require(plotly)
require(hierfstat)
require(adegenet)
require(PopGenReport)
require(pheatmap)
require(tidyverse)
require(DiagrammeR)
require(poppr)
require(genepop)
require(graph4lg)
require(knitr)
require(pegas)
require(cowplot)
This is an rstudio project. If you’d like to pre-rendered figures, read a summary of analysis and view code, please open the html file in a browser.
To conduct the analyses on your computer, edit or run code: clone this repository into a directory on you r local machine and open the .Rproj file in Rstudio. All data and analyses are available in the github repository at https://github.com/david-dayan/rogue_half_pounder.git
In addition to winter early-summer and late-summer runs, some steelhead on the Rogue express a “half-pounder” life-history. Half-pounders exhibit a false-spawning run (although some precocious males may spawn) as juveniles during the summer months, then reutrn to sea to continue the marine growth phase. There is great interest in half-pounders from a management perspective, because (i) half pounder abundance should integrate juvenile freshwater and early ocean conditions for steelhead regardless of whether they express the half pounder phenotype (ii) half pounder abundance is predictive of steelhead abundance in historical dam passage counts, and (iii) half-pounders are a unique fishery of great cultural and economic value. However, the relative proportion of winter vs early summer vs late-summer run life histories expressed by half pounders later in life is unknown. Furthermore there is limited genetic information about the three adult runs of steelhead on the Rogue.
This study attempts to use neutral and adaptive GTseq markers to examine population structure and run timing among Rogue River steelhead. We also attempt to classify half-pounders into run-timing groups on the basis of run-timing associated genetic markers.
Half Pounders
Samples were collected during ODFW’s Lower Rogue Seining Project in 2018 and 2019. The Lower Rogue Seining Project estimates escapement for Coho, late-summer run O. mykiss, half-pounder O. mykiss and fall chinook by beach seining near Huntley Park at approximate river mile 8, three times weekly from July through October. Half-pounder O. mykiss were identified as individuals with fork length 250 - 410mm and sampled in batches of up to 50 fish each day for 11 days from September 7th to October 1st 2018 totaling 384 individuals and 18 days from August 14th to September 25th 2019 totaling 331 individuals. Caudal fin clips were taken for DNA extraction, and placed in daily batch vials containing 95% ethanol. Note that due to batch collection of fin clips, that these numbers are inflated (some individuals have multiple tissue samples - identified later and filtered)
All half-pounders are non-marked and assumed to be natural origin fish.
Also ~5% of samples represented twice in GTseq library as QAQC samples
Adults
For brevity and easy code comprehension, throughout the notebook early-summer run steelhead are referred to as summer run and late-summer run steelhead are referred to as fall run.
45 winter and 45 early-run summer run fish. Adult summer steelhead were sampled at the Cole River Hatchery sorting pond on 6/26/2019 and (b) adult winter steelhead were sampled at Cole Rivers Hatchery (Rogue River) and the Applegate River from adult brood stock for 2019. Finally 166 additional late-summer run fish (fall run) were sampled at the Huntley Park seine on the lower Rogue.
All but 4 early-summer run adults are marked and assumed to hatchery origin fish. The final filtered dataset contained 1 natural origin and 41 hatchery origin fish. Winter run fish include NOR and HOR fish. After filtering the final dataset contains 18 HOR and 22 NOR fish. Early summer run fish are all unmarked and assumed NOR. All late-summer fish were unmarked and assumed NOR.
Information about sequencing data, genotype calling and filtering is available in the relevant R notebook (half_pounder_genotyping_notebook.Rmd) in this repository.
load("./genotype_data/genind_2.1.R")
load("./genotype_data/genotypes_2.3.R")
genos_2.3 <- ungroup(genos_2.3)
kable(genos_2.3 %>%
group_by(run, year) %>%
tally(), caption = "Final Filtered Dataset")
| run | year | n |
|---|---|---|
| fall | 2019 | 157 |
| halfpounder | 2018 | 338 |
| halfpounder | 2019 | 305 |
| Summer | 2019 | 42 |
| Winter | 2019 | 40 |
#save a file with this info
progeny <- readxl::read_xlsx("../meta_data/2019_summer/Omy Rogue2019 steelhead datasheets.xlsx", sheet = 4)
progeny <- progeny %>%
select(SFGL_ID, Origin)
select(genos_2.3, Sample, run, year, Date) %>%
left_join(progeny, by = c("Sample" = "SFGL_ID")) %>%
mutate(Origin = replace(Origin, Origin == "AD", "HOR")) %>%
mutate(Origin = replace(Origin, Origin == "1", "NOR")) %>%
replace_na(list(Origin = "NOR")) %>%
write_tsv("sample_metadata.txt")#convert AD (HOR) and 1 (NOR)
Throughout the notebook we refer to subsets of markers based on annotations provided by CRITFC. To my knowledge these annotations come from decisions made when the panel was assembled and edited (i.e. a run timing associated marker came out of a GWAS fro run timing). Broadly we examine 4 subsets of markers:
The table below demonstrates the annotations
marker_mapping <- readxl::read_xlsx("./metadata/final_mapping_results.xlsx", sheet = 1)
marker_summary <- marker_mapping %>%
mutate(marker = str_replace(marker, "Omy(\\d+)", "Chr\\1")) %>%
filter(marker %in% colnames(genos_2.3)) %>%
mutate(neutral = if_else(str_detect(`Presumed Type`, 'neutral|Neutral'), "neutral", "adaptive")) %>%
mutate(run_timing = if_else(str_detect(`Presumed Type`, 'run|Run'), "run_timing", "non-run_timing")) %>%
dplyr::select(marker, chr, start, 'Presumed Type', neutral, run_timing, Source)
DT::datatable(marker_summary, options = list(pageLength=10))
Our primary goals:
(a) examine population structure within and among the three run timing groups in the Rogue River (early summer, late summer and winter runs) (late summer referred to as fall in notebook)
(b) classify half-pounders into run timing group.
To begin, we will collect nucleotide diversity index at the marker and population levels, examine population structure at neutral and adaptive markers using multivariate and bayesian approaches, and assign half pounders to run timing groups based on a genetic classifier.
Here we calculate both per-marker and per-population diversity metrics (Ho He HWE Fis and Fst)
# first Ho and He
n.pop <- seppop(genind_2.1)
hobs <- lapply(n.pop, function(x) (summary(x)$Hobs))
hobs <- as.data.frame(t(do.call(rbind, hobs)))
hobs <- hobs %>%
rownames_to_column(var="marker")
hobs <- hobs %>%
pivot_longer(-marker, names_to = "run", values_to = "Ho")
#ggplot(hobs)+geom_boxplot(aes(x=pop, y=hobs))+theme_classic()+xlab("Run Timing")+ylab("Observed Heterozygosity")
hexp <- lapply(n.pop, function(x) (summary(x)$Hexp))
hexp <- as.data.frame(t(do.call(rbind, hexp)))
hexp <- hexp %>%
rownames_to_column(var="marker") %>%
pivot_longer(-marker, names_to = "run", values_to = "He")
marker_divs <- hexp %>%
left_join(hobs)
#now lets add the annotation status (neutral vs adaptive)
marker_mapping <- readxl::read_xlsx("./metadata/final_mapping_results.xlsx", sheet = 1)
marker_divs <- marker_mapping %>%
select(marker, `Presumed Type`) %>%
mutate(marker = str_replace(marker, "Omy(\\d+)", "Chr\\1")) %>% #marker name convention is different
mutate(neutral = if_else(str_detect(`Presumed Type`, 'Adaptive|adaptive'), "adaptive", "neutral")) %>%
right_join(marker_divs)
# lets add known run-timing marker as a grouping variable, and conver to longer format
marker_divs <- marker_divs %>%
mutate(run_timing_marker = if_else(str_detect(`Presumed Type`, 'Run|run'), "run" , "non-run")) %>%
pivot_longer(c("Ho", "He"), names_to = "obs_exp", values_to ="H")
#nice, now lets make a long table
# lets throw up some nice plots here
# first for all markers
ggplot(data=marker_divs)+geom_boxplot(aes(x=run, y=H, fill = obs_exp))+scale_fill_viridis_d()+theme_classic()+ggtitle("all markers")
ggplot(data=marker_divs[marker_divs$neutral=="neutral",])+geom_boxplot(aes(x=run, y=H, fill = obs_exp))+scale_fill_viridis_d()+theme_classic()+ggtitle("neutral markers")
ggplot(data=marker_divs[marker_divs$neutral=="adaptive",])+geom_boxplot(aes(x=run, y=H, fill = obs_exp))+scale_fill_viridis_d()+theme_classic()+ggtitle("adaptive markers")
ggplot(data=marker_divs[marker_divs$run_timing_marker=="run",])+geom_boxplot(aes(x=run, y=H, fill = obs_exp))+scale_fill_viridis_d()+theme_classic()+ggtitle("run-timing markers")
#publication plot fig. 1
#ggplot(data=drop_na(marker_divs[marker_divs$run_timing_marker=="run",]))+geom_boxplot(aes(x=run, y=H, fill = obs_exp))+scale_fill_viridis_d(labels = c(expression(H[E]), expression(H[O])), name = " ")+theme_classic()+xlab("")+scale_x_discrete(labels = c("Late-Summer", "Half-Pounder", "Early-Summer", "Winter"))
#publication plot suppl fig. 1
# a <- ggplot(data=drop_na(marker_divs[marker_divs$neutral=="neutral",]))+geom_boxplot(aes(x=run, y=H, fill = obs_exp))+scale_fill_viridis_d(labels = c(expression(H[E]), expression(H[O])), name = " ")+theme_classic()+xlab("")+scale_x_discrete(labels = c("Late-Summer", "Half-Pounder", "Early-Summer", "Winter"))+theme(legend.position = "none")
#
# b <- ggplot(data=drop_na(marker_divs[marker_divs$neutral=="adaptive",]))+geom_boxplot(aes(x=run, y=H, fill = obs_exp))+scale_fill_viridis_d(labels = c(expression(H[E]), expression(H[O])), name = " ")+theme_classic()+xlab("")+scale_x_discrete(labels = c("Late-Summer", "Half-Pounder", "Early-Summer", "Winter"))+ylim(0,0.68)
#
# plot_grid(a,b, rel_widths = c(0.83,1), labels = c("(a)", "(b)"))
#supplemental table
marker_divs %>%
group_by(neutral, run, obs_exp) %>%
summarise(mean = mean(H)) %>%
print(n = 24)
## # A tibble: 24 x 4
## # Groups: neutral, run [12]
## neutral run obs_exp mean
## <chr> <chr> <chr> <dbl>
## 1 adaptive fall He 0.250
## 2 adaptive fall Ho 0.244
## 3 adaptive halfpounder He 0.255
## 4 adaptive halfpounder Ho 0.235
## 5 adaptive Summer He 0.219
## 6 adaptive Summer Ho 0.223
## 7 adaptive Winter He 0.230
## 8 adaptive Winter Ho 0.225
## 9 neutral fall He 0.308
## 10 neutral fall Ho 0.302
## 11 neutral halfpounder He 0.310
## 12 neutral halfpounder Ho 0.300
## 13 neutral Summer He 0.304
## 14 neutral Summer Ho 0.297
## 15 neutral Winter He 0.311
## 16 neutral Winter Ho 0.308
## 17 <NA> fall He 0.333
## 18 <NA> fall Ho 0.350
## 19 <NA> halfpounder He 0.334
## 20 <NA> halfpounder Ho 0.329
## 21 <NA> Summer He 0.350
## 22 <NA> Summer Ho 0.402
## 23 <NA> Winter He 0.358
## 24 <NA> Winter Ho 0.321
marker_divs %>%
group_by(run_timing_marker, run, obs_exp) %>%
summarise(mean = mean(H)) %>%
print(n = 24)
## # A tibble: 24 x 4
## # Groups: run_timing_marker, run [12]
## run_timing_marker run obs_exp mean
## <chr> <chr> <chr> <dbl>
## 1 non-run fall He 0.287
## 2 non-run fall Ho 0.278
## 3 non-run halfpounder He 0.289
## 4 non-run halfpounder Ho 0.279
## 5 non-run Summer He 0.286
## 6 non-run Summer Ho 0.282
## 7 non-run Winter He 0.291
## 8 non-run Winter Ho 0.287
## 9 run fall He 0.358
## 10 run fall Ho 0.390
## 11 run halfpounder He 0.357
## 12 run halfpounder Ho 0.266
## 13 run Summer He 0.00933
## 14 run Summer Ho 0.00992
## 15 run Winter He 0.0960
## 16 run Winter Ho 0.102
## 17 <NA> fall He 0.333
## 18 <NA> fall Ho 0.350
## 19 <NA> halfpounder He 0.334
## 20 <NA> halfpounder Ho 0.329
## 21 <NA> Summer He 0.350
## 22 <NA> Summer Ho 0.402
## 23 <NA> Winter He 0.358
## 24 <NA> Winter Ho 0.321
Are these differences between populations significant? What about differences between nuetral and adaptive sets of loci. To test with the same number of loci (across populations) we’ll use a monte-carlo test and 999 permutations. To test across different sets of loci (neutral vs adaptive), we’ll use a simple t-test
#lets make a way to easily called different snp classes
#"neutral" markers
neutral_loci_names <- marker_mapping %>%
filter(str_detect(`Presumed Type`, 'neutral|Neutral')) %>%
dplyr::select(marker)
#different naming convention, lets fix
neutral_loci_names <- str_replace(neutral_loci_names$marker, "Omy28", "Chr28")
#run timing markers
run_timing_loci_names <- marker_mapping %>%
filter(chr == "28" | CRITFC_chromosome == "28") %>%
filter(str_detect(`Presumed Type`, 'run|Run')) %>%
select(marker)
#different naming convention, lets fix
run_timing_loci_names <- str_replace(run_timing_loci_names$marker, "Omy28", "Chr28")
## test for all pairwise differences at run-timing markers
#make a pop separted genind at run timing markers
chr28_seppop <- seppop(genind_2.1[loc=run_timing_loci_names])
# fall-half
a <- Hs.test(chr28_seppop$fall, chr28_seppop$halfpounder)
#fall-summer
b <- Hs.test(chr28_seppop$fall, chr28_seppop$Summer)
#fall-winter
c <- Hs.test(chr28_seppop$fall, chr28_seppop$Winter)
#half-summer
d <- Hs.test(chr28_seppop$halfpounder, chr28_seppop$Summer)
#half-winter
e <- Hs.test(chr28_seppop$halfpounder, chr28_seppop$Winter)
#summer-winter
f <- Hs.test(chr28_seppop$Summer, chr28_seppop$Winter)
#now make a nice table
df <- as.data.frame(cbind(c("fall", "fall", "fall", "half", "half", "summer"), c("half", "summer", "winter", "summer", "winter", "winter"), c(a$pvalue,b$pvalue,c$pvalue,d$pvalue,e$pvalue,f$pvalue)))
colnames(df) <- c("pop1", "pop2","p-value")
kable(df, caption = "Monte-Carlo p-value for difference in observed Heterozygosity at run-timing markers")
| pop1 | pop2 | p-value |
|---|---|---|
| fall | half | 0.998 |
| fall | summer | 0.001 |
| fall | winter | 0.001 |
| half | summer | 0.001 |
| half | winter | 0.001 |
| summer | winter | 0.001 |
## Next lets look if there is increased/decreased diversity within a population at these markers compared to neutral
neutral_seppop <- seppop(genind_2.1[loc=neutral_loci_names])
#fall
a <- t.test(summary(chr28_seppop$fall)$Hexp, summary(neutral_seppop$fall)$Hexp, alternative = "greater")
#half
b <- t.test(summary(chr28_seppop$halfpounder)$Hexp, summary(neutral_seppop$halfpounder)$Hexp, alternative = "less")
#summer
c <- t.test(summary(chr28_seppop$Summer)$Hexp, summary(neutral_seppop$Summer)$Hexp, alternative = "less")
#winter
d <- t.test(summary(chr28_seppop$Winter)$Hexp, summary(neutral_seppop$Winter)$Hexp, , alternative = "less")
#now make a nice table
df <- as.data.frame(cbind(c("fall", "half", "summer", "winter"),c(a$p.value,b$p.value,c$p.value,d$p.value)))
colnames(df) <- c("pop", "p-val")
kable(df, caption = "T-test p-value for difference He at run timing markers vs neutral markers")
| pop | p-val |
|---|---|
| fall | 0.208471063569454 |
| half | 0.79818162732174 |
| summer | 5.65887777760058e-28 |
| winter | 0.000149223818960633 |
All pairwise population comparisons of He at chr28 are significantly different (p=0.001) EXCEPT fall-halfpounder. Winter and Summer populations demonstrate significantly reduced He at chr28 markers relative to neutral markers.
Are there significant departures from HWE at the loci level?
# here we use the hw.test function from pegas (exact test based on Monte Carlo permutations of alleles, 1000 permutations)
HWE.test <- lapply(n.pop, function(x) hw.test(x, B=1000))
# here we take the list of dataframes of p-values and combine into a single dataframe
hwe <- reduce(HWE.test, cbind)
hwe <- hwe[,c(4,8,12,16)]
colnames(hwe) <- c("fall", "halfpounder", "summer", "winter")
hwe <- as.data.frame(hwe)
# next we correct for multiple comparisons
p.adj <- as.data.frame(apply(hwe, MARGIN = 2, function(x) p.adjust(x, "fdr")))
hwe_exceed <- p.adj %>% rownames_to_column(var="marker") %>%
pivot_longer(-marker, names_to = "run", values_to = "fdr") %>%
filter(fdr < 0.1)
hwe_exceed <- hwe_exceed %>%
left_join(pivot_wider(marker_divs, names_from = obs_exp, values_from = H)) %>%
mutate(direction = if_else(He > Ho, "excess_homo", "excess_hetero")) %>%
filter(direction == "excess_homo")
a <- hwe_exceed%>%
group_by(run) %>%
tally()
kable(a, caption = "Number of markers significantly out of HWE")
| run | n |
|---|---|
| fall | 11 |
| halfpounder | 65 |
Yes, see table above for number of SNPs outside of HWE (fdr < 0.1) per “population”
#lets check if there is an enrichment of run-timing and adaptive markers in markers out of HWE in fall and half-pounders
#lets get marker info, but only for markers in the panel
marker_mapping2 <- marker_mapping %>%
mutate(marker = str_replace(marker, "Omy(\\d+)", "Chr\\1")) %>%
filter(marker %in% colnames(genos_2.3))
# halfpunder enriched for run timing markers
a <- nrow(hwe_exceed[hwe_exceed$run=="halfpounder" & hwe_exceed$run_timing_marker=="run",])
c <- nrow(marker_mapping2)-a
b <- nrow(hwe_exceed[hwe_exceed$run=="halfpounder" & hwe_exceed$run_timing_marker!="run",])
d <- nrow(marker_mapping2)-b
hr <- fisher.test(matrix(c(a,b,c,d), nrow=2), alternative = "less")
#fall for run-timing markers
a <- nrow(hwe_exceed[hwe_exceed$run=="fall" & hwe_exceed$run_timing_marker=="run",])
c <- nrow(marker_mapping2)-a
b <- nrow(hwe_exceed[hwe_exceed$run=="fall" & hwe_exceed$run_timing_marker!="run",])
d <- nrow(marker_mapping2)-b
fr <- fisher.test(matrix(c(a,b,c,d), nrow=2), alternative = "less")
#halfpounder for adaptive markers
a <- nrow(hwe_exceed[hwe_exceed$run=="halfpounder" & hwe_exceed$neutral=="adaptive",])
c <- nrow(marker_mapping2)-a
b <- nrow(hwe_exceed[hwe_exceed$run=="halfpounder" & hwe_exceed$neutral!="adaptive",])
d <- nrow(marker_mapping2)-b
ha <- fisher.test(matrix(c(a,b,c,d), nrow=2), alternative = "less")
#fall for adaptive markers
a <- nrow(hwe_exceed[hwe_exceed$run=="fall" & hwe_exceed$neutral=="adaptive",])
c <- nrow(marker_mapping2)-a
b <- nrow(hwe_exceed[hwe_exceed$run=="fall" & hwe_exceed$neutral!="adaptive",])
d <- nrow(marker_mapping2)-b
fa <- fisher.test(matrix(c(a,b,c,d), nrow=2), alternative = "less")
# summer enriched for run timing markers
a <- nrow(hwe_exceed[hwe_exceed$run=="Summer" & hwe_exceed$run_timing_marker=="run",])
c <- nrow(marker_mapping2)-a
b <- nrow(hwe_exceed[hwe_exceed$run=="Summer" & hwe_exceed$run_timing_marker!="run",])
d <- nrow(marker_mapping2)-b
sr <- fisher.test(matrix(c(a,b,c,d), nrow=2), alternative = "less")
#make a nice table
df <- as.data.frame(cbind(c("fall", "fall", "half", "half"),c("run", "all adaptive", "run", "all adaptive"),c(fr$p.value, fa$p.value, hr$p.value, ha$p.value), c(fr$estimate, fa$estimate, hr$estimate, ha$estimate)))
colnames(df) <- c("pop", "marker type", "p-val", "fold-enrichment")
kable(df, caption = "enrichment (fisher's exact test) of marker types among markers out of HWE")
| pop | marker type | p-val | fold-enrichment | |
|---|---|---|---|---|
| odds.ratio | fall | run | 0.000451154032436204 | 0 |
| odds.ratio.1 | fall | all adaptive | 0.888555009972302 | 1.76377506668203 |
| odds.ratio.2 | half | run | 6.75878160057321e-10 | 0.155002415327548 |
| odds.ratio.3 | half | all adaptive | 0.259137860680439 | 0.81808803713288 |
Markers significantly out of HWE (Ho < He: excess of homozygotes) are enriched for run-timing markers, but not adaptive markers in the fall and half-pounder groups.
Now let’s make a nice visual representation of this information (fall and half pounder demonstrate a dearth of heterozygotes at run ting markers)
plot_data_half <- marker_divs %>%
filter(run == "halfpounder") %>%
pivot_wider( names_from = obs_exp, values_from = H)
ggplot(plot_data_half)+geom_point(aes(He, Ho, color = run_timing_marker))+geom_abline(aes(intercept=0, slope=1))+coord_equal(ratio=1)+ylim(0,0.6)+scale_color_viridis_d()+theme_classic()+ggtitle("halfpounder")
plot_data_fall <- marker_divs %>%
filter(run == "fall") %>%
pivot_wider( names_from = obs_exp, values_from = H)
ggplot(plot_data_fall)+geom_point(aes(He, Ho, color = run_timing_marker))+geom_abline(aes(intercept=0, slope=1))+coord_equal(ratio=1)+ylim(0,0.6)+scale_color_viridis_d()+theme_classic()+ggtitle("fall")
It’s surprising that there is enrichment of run timing markers among markers out of HWE in the fall run from these figures, but double checked the code, still significant enriched… Meanwhile the results in halfpounders are very clear. Wait! the hwe test was two sided, the enrichment of run-timing markers in the fall run is actually due a marker with MORE heterozygosity than expected.
At neutral markers, there is a slight reduction of observed from expected heterozygosity, most pronounced among summer run fish. At run timing markers, there is a significant reduction in diversity (Ho) relative to neutral markers among the summer run and winter run fish (one-sided t-test), and increased diversity at fall run and half-pounders (not significant). Furthermore, there is an excess of homozygosity at run timing associated SNPs in the half-pounders (but not fall fish) suggesting some structure within this group. Among markers with significant departures from HWE within fall and half-pounders, there is a significant enrichment of run-timing associated markers ( p-value = 0.0016 and p-value = 7.88e-12, odds ratios 12.4 and 8.0, respectively fisher’s exact test), but not significant enrichment for markers annotated as adaptive overall.
Let’s move on to f-statistics. We’ll use hierfstats for most of this work, so the first step is to convert to a hierfstat format. Then we’ll calculate some basic statists and Fst (Nei)
fstat <- genind2hierfstat(genind_2.1)
colnames(fstat) <- c(pop, names(genind_2.1$loc.n.all))
# and also lets make datasets at different sets of loci, because hierfstat doesn't easily retain loci names for later use
fstat_neutral <- genind2hierfstat(genind_2.1[loc=neutral_loci_names])
colnames(fstat_neutral) <- c(pop, names(genind_2.1[loc=neutral_loci_names]$loc.n.all))
fstat_run_timing <- genind2hierfstat(genind_2.1[loc=run_timing_loci_names])
colnames(fstat_run_timing) <- c(pop, names(genind_2.1[loc=run_timing_loci_names]$loc.n.all))
#calculate datset wide basic stats
basicstats <- basic.stats(fstat)
basicstats_neutral <- basic.stats(fstat_neutral)
basicstats_run_timing <- basic.stats(fstat_run_timing)
kable(basicstats$overall, caption = "All markers Fstats")
| x | |
|---|---|
| Ho | 0.2808 |
| Hs | 0.2891 |
| Ht | 0.2963 |
| Dst | 0.0072 |
| Htp | 0.2987 |
| Dstp | 0.0096 |
| Fst | 0.0243 |
| Fstp | 0.0321 |
| Fis | 0.0287 |
| Dest | 0.0135 |
kable(basicstats_neutral$overall, caption = "Neutral marker Fstats")
| x | |
|---|---|
| Ho | 0.3045 |
| Hs | 0.3135 |
| Ht | 0.3149 |
| Dst | 0.0014 |
| Htp | 0.3154 |
| Dstp | 0.0018 |
| Fst | 0.0043 |
| Fstp | 0.0058 |
| Fis | 0.0288 |
| Dest | 0.0026 |
kable(basicstats_run_timing$overall, caption = "Run Timing Marker Fstats")
| x | |
|---|---|
| Ho | 0.1919 |
| Hs | 0.2066 |
| Ht | 0.3743 |
| Dst | 0.1677 |
| Htp | 0.4302 |
| Dstp | 0.2236 |
| Fst | 0.4480 |
| Fstp | 0.5198 |
| Fis | 0.0709 |
| Dest | 0.2818 |
Now let’s look at the distribution of Fst and check if markers with certain annotations are enriched among outliers.
basicstats$perloc$run_timing <- if_else(rownames(basicstats$perloc) %in% run_timing_loci_names, "run", "non-run")
basicstats$perloc$neutral <- if_else(rownames(basicstats$perloc) %in% neutral_loci_names, "neutral", "adaptive")
ggplot(basicstats$perloc)+geom_histogram(aes(x=Fst, fill=neutral))+theme_classic()
ggplot(basicstats$perloc)+geom_histogram(aes(x=Fst, fill=run_timing))+theme_classic()
#check if any non-run timing markers have high Fst
#max((basicstats$perloc[basicstats$perloc$run_timing=="non-run",])$Fst, na.rm = TRUE)
Yes, there are some clear fst outlier loci in the dataset. These outliers are composed solely of known run-timing markers, but not all run-timing markers demostrate high differentiation. The maximum fst on a non-run-timing marker was 0.0486
Let’s collect genetic distance info on pairs of pops as well (Weir and Cochran 1984).
First, the full dataset:
genet.dist(fstat, method="WC84")
## fall halfpounder Summer
## halfpounder 0.001194555
## Summer 0.036685102 0.036916689
## Winter 0.018712070 0.013652717 0.081269336
Then just neutral loci:
genet.dist(fstat_neutral, method="WC84")
## fall halfpounder Summer
## halfpounder 0.000193948
## Summer 0.009047995 0.008312525
## Winter 0.003447739 0.002929805 0.010647062
Let’s split the halfpounders into year classes and run again
genind_year <- genind_2.1
genind_year@pop <- as.factor(paste(genos_2.3$run, genos_2.3$year, sep = "_"))
fstat_year <- genind2hierfstat(genind_year)
colnames(fstat_year) <- c(pop, names(genind_year$loc.n.all))
# and also lets make datasets at different sets of loci, because hierfstat doesn't easily retain loci names for later use
fstat_neutral_year <- genind2hierfstat(genind_year[loc=neutral_loci_names])
## Warning: the following specified loci do not exist: Omy_BAMBI2.312,
## Omy_118205-116, Omy_G3PD_2.246, Omy_u09-53.469, Omy_u09-61.043, Omy_pad-196,
## Omy_97865-196, M09AAE.082, Omy_ftzf1-217, Omy_BAMBI4.238, Omy_BAC-F5.284,
## OMS00074, Omy_ndk-152, Omy_Ogo4-212, M09AAJ.163, OMS00018, Omy_u09-56.119,
## M09AAD.076, Omy_Il-1b_.028, M09AAC.055, Omy_104569-114, OMS00173, Omy_impa1-55,
## Omy_u09-52.284, Omy_gsdf-291
colnames(fstat_neutral_year) <- c(pop, names(genind_year[loc=neutral_loci_names]$loc.n.all))
## Warning: the following specified loci do not exist: Omy_BAMBI2.312,
## Omy_118205-116, Omy_G3PD_2.246, Omy_u09-53.469, Omy_u09-61.043, Omy_pad-196,
## Omy_97865-196, M09AAE.082, Omy_ftzf1-217, Omy_BAMBI4.238, Omy_BAC-F5.284,
## OMS00074, Omy_ndk-152, Omy_Ogo4-212, M09AAJ.163, OMS00018, Omy_u09-56.119,
## M09AAD.076, Omy_Il-1b_.028, M09AAC.055, Omy_104569-114, OMS00173, Omy_impa1-55,
## Omy_u09-52.284, Omy_gsdf-291
genet.dist(fstat_year, method="WC84")
## fall_2019 halfpounder_2018 halfpounder_2019 Summer_2019
## halfpounder_2018 0.001969572
## halfpounder_2019 0.001044190 0.001300028
## Summer_2019 0.036685102 0.040940216 0.033264677
## Winter_2019 0.018712070 0.010852483 0.017463481 0.081269336
genet.dist(fstat_neutral_year, method="WC84")
## fall_2019 halfpounder_2018 halfpounder_2019 Summer_2019
## halfpounder_2018 0.0004871133
## halfpounder_2019 0.0002725602 0.0007315691
## Summer_2019 0.0090479952 0.0083556453 0.0086611756
## Winter_2019 0.0034477391 0.0023854085 0.0038966945 0.0106470621
Next let’s test these Fst results to see if they are significant
To accomplish this we’ll randomize over populations 1000 fold and then conduct a monte-carlo test. Note this approach is very computationally intensive, should find another one if we really want a significance test on Fst, skipped for now.
mat.obs <- pairwise.WCfst(as.data.frame(cbind((genind_2.1$pop),genind_2.1$tab)))
NBPERM <- 100 # this is the number of permutations used for the p-values; for a publication at least 999 would be required.
mat.perm <- lapply(1:NBPERM, function(i) pairwise.WCfst(as.data.frame(cbind(sample(genind_2.1$pop, size = 882),genind_2.1$tab))))
allTests <- list()
for(i in 1:(nrow(mat.obs)-1)){
for(j in 2:nrow(mat.obs)){
allTests[[paste(rownames(mat.obs)[i],rownames(mat.obs)[j],sep="-")]] <- as.randtest(na.omit(sapply(1:NBPERM, function(k) mat.perm[[k]][i,j])), mat.obs[i,j], alter="greater")
}
}
Moderate differentiation between most pairwise pop comps (not fall and halfpounder), but as suspected, this is driven mostly by the run timing markers. When examining only neutral annotated loci, differentiation is extremely low (less than 1%). Also of note here is that both the neutral and total estimates of differentiation between early and late summer (fall) populations (XX% and XX%, respectively) is quite high. Mean Fst over all loci is about half the level of differentation between early summer and winter populations (XX%) and nearly the same when examining only neutral loci (XX%). Indeed, the late summer run fish demonstrate lower differentiation from winter run than early summer run fish at both neutral and all loci.
From the Fst estimation in the section above we have our first ideas about population structure: (a) fall run and half pounder fish demonstrate little to no differentiation, and this group is less differentiated from winter run than summer run fish and (b) some evidence of structure WITHIN halfpounder and fall runs. In this section we will conduct several analyses to uncover population structure in greater detail.
Here we use a bayesian, model based clustering algorithm (STRUCTURE) to infer population structure and estimate admixture proportions of individual samples.
First we need to get our dataset ready for structure: remove linked loci, convert to structure format.
# first lets calculate LD (dartR has a great (fast) ld estimator that works right on genind files, so let's use this)
ldreport <- dartR::gl.report.ld(genind_2.1, name = NULL, save = FALSE, nchunks = 2, ncores = 3, chunkname = NULL)
## Start to calculate LD for all pairs of loci...
## Using 3 cores in 2 chunks.
## Depending on the number of loci this may take a while...
## nchunks specifies the number of steps in the progress bar and the number of intermediate saves, but slows the computation a bit. nchunks = 1 is fastest.
## Seperate all 882 loci...
## Generate all possible pairs: 62128 ...
## Calculate LD for all pairs...
##
|
|=================================== | 50%
|
|======================================================================| 100%
## # Simulations: 62128 . Took 37 seconds.
We’ll prune (keep one) the dataset of any locus-pairs with r2 > 0.2, then convert to STRUCTURE format
unlinked_genind <- genind_2.1[loc=-unique(ldreport[ldreport$R2>0.2,]$loc2)]
rm(ldreport)
#note just sort of crashed through this with a text editor, not easily logged, but the general idea was transpose the data, split columns (diploid to dual haploid) then convert data to integers
df <- genind2df(unlinked_genind)
df <- as.data.frame(t(df))
write_tsv(df, "genotype_data/all.str.tmp")
df <- read_tsv("genotype_data/all.str.tmp", col_names = FALSE)
df <- t(df)
write_tsv(as.data.frame(df), "genotype_data/all.str", col_names = FALSE)
Note that the final structure data file did not differ between when this run the first time (without removing the 2 redundant and missingness loci). So structure did not need to be run a second time. This removed 23 highly linked SNPs from the dataset. (25 the first time, but two of these were filtered out at an earlier stage in this run)
Structure was run in a GUI outside this computation notebook’s environment.
admixture model: admixture, with correlated allele frequency
burnin/mcmc: ran with k=1-3 for 100k iteration to check for convergence of alpha, strong convergence after a few hundred burn-in iterations, used 10,000/20,000 for final runs
replicates: did 10 replicates for k=1-6
Best K was chosen by the evanno method, and estimated in structure harvester.
Replicate results within each K were clumpp’d using the clumpak algorithm on the clumpak webserver
Here we visualize the structure results of the clumpp’d results of all K values
Best K
Best K was 2 according to the Evanno (delta K) method, however, it’s important to remember the bias toward k=2 when differentiation is low or there is no population structure using this method. Delta k literally cannot evaluate K=1. (see note below)
note: delta K suggests best k is two, however, particularly at low differentiation, the delta k method is biased towards k=2 (cullingham 2020), seems like a good place to remind myself that k is a model that doesn’t always fully catpure biological reality and comparing results across different levels of k can proide interesting insights, even when best k is unknown, particularly in the case of low differentiation. Also a note that it might be good to review the literature again on bayesian clustering methods for low levels of differentation, e.g. that gagnaire paper and latch 2006.
delta K plot
Structure Plots
Next let’s take the clumppd results and make some publication-ready figures. First plot is downsampled to 42 samples per “population”, second plot is full dataset.
#import clump results into dataframes
# results are in files k*/majorcluster/clumppfiles/clumpindoutput
# took these files and captured the relevant data with a text editor (original input files are a mess with multiple field separators) and saved to new files
k2 <- read_tsv("./structure/halfpounder/clumpak/formatted_results/k2.txt")
k3 <- read_tsv("./structure/halfpounder/clumpak/formatted_results/k3.txt")
k4 <- read_tsv("./structure/halfpounder/clumpak/formatted_results/k4.txt")
k5 <- read_tsv("./structure/halfpounder/clumpak/formatted_results/k5.txt")
plot_data <- k2 %>%
rownames_to_column(var="id") %>%
group_by(pop) %>%
sample_n(40) %>% # sample the half_pounder and fall fish to smaller size for plot
ungroup() %>%
gather('cluster', 'prob', clust1:clust2) %>%
group_by(id) %>%
mutate(likely_assignment = cluster[which.max(prob)],
assingment_prob = max(prob)) %>%
arrange(likely_assignment, desc(assingment_prob)) %>%
ungroup()
a <- ggplot(plot_data, aes(id, prob, fill = cluster)) +
geom_col(width=1.0) +
facet_grid(~pop, scales = 'free', space = 'free', switch = "x") +
scale_y_continuous(expand = c(0, 0)) +
scale_x_discrete(expand = expand_scale(add = 1)) +
theme(panel.spacing=unit(0.1, "lines"), axis.title.x=element_blank(), axis.text.x=element_blank(), axis.ticks.x=element_blank(), legend.position = "none", axis.title.y=element_blank(), strip.background = element_rect(color = "white", fill = "white"), strip.text.x = element_blank()) +
scale_fill_manual(values = viridisLite::viridis(2))
plot_data <- k3 %>%
rownames_to_column(var="id") %>%
group_by(pop) %>%
sample_n(40) %>% # sample the half_pounder and fall fish to smaller size for plot
ungroup() %>%
gather('cluster', 'prob', clust1:clust3) %>%
group_by(id) %>%
mutate(likely_assignment = cluster[which.max(prob)],
assingment_prob = max(prob)) %>%
arrange(likely_assignment, desc(assingment_prob)) %>%
ungroup()
b <- ggplot(plot_data, aes(id, prob, fill = cluster)) +
geom_col(width=1.0) +
facet_grid(~pop, scales = 'free', space = 'free', switch = "x") +
scale_y_continuous(expand = c(0, 0)) +
scale_x_discrete(expand = expand_scale(add = 1)) +
theme(panel.spacing=unit(0.1, "lines"), axis.title.x=element_blank(), axis.text.x=element_blank(), axis.ticks.x=element_blank(), legend.position = "none", axis.title.y=element_blank(), strip.background = element_rect(color = "white", fill = "white"), strip.text.x = element_blank()) +
scale_fill_manual(values = viridisLite::viridis(3))
plot_data <- k4 %>%
rownames_to_column(var="id") %>%
group_by(pop) %>%
sample_n(40) %>% # sample the half_pounder and fall fish to smaller size for plot
ungroup() %>%
gather('cluster', 'prob', clust1:clust4) %>%
group_by(id) %>%
mutate(likely_assignment = cluster[which.max(prob)],
assingment_prob = max(prob)) %>%
arrange(likely_assignment, desc(assingment_prob)) %>%
ungroup()
c <- ggplot(plot_data, aes(id, prob, fill = cluster)) +
geom_col(width=1.0) +
facet_grid(~pop, scales = 'free', space = 'free', switch = "x") +
scale_y_continuous(expand = c(0, 0)) +
scale_x_discrete(expand = expand_scale(add = 1)) +
theme(panel.spacing=unit(0.1, "lines"), axis.title.x=element_blank(), axis.text.x=element_blank(), axis.ticks.x=element_blank(), legend.position = "none", axis.title.y=element_blank(), strip.background = element_rect(color = "white", fill = "white"), strip.text.x = element_blank()) +
scale_fill_manual(values = viridisLite::viridis(4))
plot_data <- k5 %>%
rownames_to_column(var="id") %>%
group_by(pop) %>%
sample_n(40) %>% # sample the half_pounder and fall fish to smaller size for plot
ungroup() %>%
gather('cluster', 'prob', clust1:clust5) %>%
group_by(id) %>%
mutate(likely_assignment = cluster[which.max(prob)],
assingment_prob = max(prob)) %>%
arrange(likely_assignment, desc(assingment_prob)) %>%
ungroup()
d <- ggplot(plot_data, aes(id, prob, fill = cluster)) +
geom_col(width=1.0) +
facet_grid(~pop, scales = 'free', space = 'free', switch = "x", labeller = labeller( pop = c("fall" ="Late-Summer", "half"= "Half-Pounder", "summer" = "Early-Summer", "winter" = "Winter"))) +
scale_y_continuous(expand = c(0, 0)) +
scale_x_discrete(expand = expand_scale(add = 1)) +
theme(panel.spacing=unit(0.1, "lines"), axis.title.x=element_blank(), axis.text.x=element_blank(), axis.ticks.x=element_blank(), legend.position = "none", axis.title.y=element_blank(), strip.background = element_rect(color = "white", fill = "white")) +
scale_fill_manual(values = viridisLite::viridis(5))
cowplot::plot_grid(a,b,c,d, ncol=1)
plot_data <- k2 %>%
rownames_to_column(var="id") %>%
# sample the half_pounder and fall fish to smaller size for plot
gather('cluster', 'prob', clust1:clust2) %>%
group_by(id) %>%
mutate(likely_assignment = cluster[which.max(prob)],
assingment_prob = max(prob)) %>%
arrange(likely_assignment, desc(assingment_prob)) %>%
ungroup()
a <- ggplot(plot_data, aes(id, prob, fill = cluster)) +
geom_col(width=1.0) +
facet_grid(~pop, scales = 'free', space = 'free', switch = "x") +
scale_y_continuous(expand = c(0, 0)) +
scale_x_discrete(expand = expand_scale(add = 1)) +
theme(panel.spacing=unit(0.1, "lines"), axis.title.x=element_blank(), axis.text.x=element_blank(), axis.ticks.x=element_blank(), legend.position = "none", axis.title.y=element_blank(), strip.background = element_rect(color = "white", fill = "white"), strip.text.x = element_blank()) +
scale_fill_manual(values = viridisLite::viridis(2))
plot_data <- k3 %>%
rownames_to_column(var="id") %>%
gather('cluster', 'prob', clust1:clust3) %>%
group_by(id) %>%
mutate(likely_assignment = cluster[which.max(prob)],
assingment_prob = max(prob)) %>%
arrange(likely_assignment, desc(assingment_prob)) %>%
ungroup()
b <- ggplot(plot_data, aes(id, prob, fill = cluster)) +
geom_col(width=1.0) +
facet_grid(~pop, scales = 'free', space = 'free', switch = "x") +
scale_y_continuous(expand = c(0, 0)) +
scale_x_discrete(expand = expand_scale(add = 1)) +
theme(panel.spacing=unit(0.1, "lines"), axis.title.x=element_blank(), axis.text.x=element_blank(), axis.ticks.x=element_blank(), legend.position = "none", axis.title.y=element_blank(), strip.background = element_rect(color = "white", fill = "white"), strip.text.x = element_blank()) +
scale_fill_manual(values = viridisLite::viridis(3))
plot_data <- k4 %>%
rownames_to_column(var="id") %>%
gather('cluster', 'prob', clust1:clust4) %>%
group_by(id) %>%
mutate(likely_assignment = cluster[which.max(prob)],
assingment_prob = max(prob)) %>%
arrange(likely_assignment, desc(assingment_prob)) %>%
ungroup()
c <- ggplot(plot_data, aes(id, prob, fill = cluster)) +
geom_col(width=1.0) +
facet_grid(~pop, scales = 'free', space = 'free', switch = "x") +
scale_y_continuous(expand = c(0, 0)) +
scale_x_discrete(expand = expand_scale(add = 1)) +
theme(panel.spacing=unit(0.1, "lines"), axis.title.x=element_blank(), axis.text.x=element_blank(), axis.ticks.x=element_blank(), legend.position = "none", axis.title.y=element_blank(), strip.background = element_rect(color = "white", fill = "white"), strip.text.x = element_blank()) +
scale_fill_manual(values = viridisLite::viridis(4))
plot_data <- k5 %>%
rownames_to_column(var="id") %>%
gather('cluster', 'prob', clust1:clust5) %>%
group_by(id) %>%
mutate(likely_assignment = cluster[which.max(prob)],
assingment_prob = max(prob)) %>%
arrange(likely_assignment, desc(assingment_prob)) %>%
ungroup()
d <- ggplot(plot_data, aes(id, prob, fill = cluster)) +
geom_col(width=1.0) +
facet_grid(~pop, scales = 'free', space = 'free', switch = "x", labeller = labeller( pop = c("fall" ="Late-Summer", "half"= "Half-Pounder", "summer" = "Early-Summer", "winter" = "Winter"))) +
scale_y_continuous(expand = c(0, 0)) +
scale_x_discrete(expand = expand_scale(add = 1)) +
theme(panel.spacing=unit(0.1, "lines"), axis.title.x=element_blank(), axis.text.x=element_blank(), axis.ticks.x=element_blank(), legend.position = "none", axis.title.y=element_blank(), strip.background = element_rect(color = "white", fill = "white"), strip.text.x = element_text(angle = 90)) +
scale_fill_manual(values = viridisLite::viridis(5))
cowplot::plot_grid(a,b,c,d, rel_heights = c(1,1,1,1.5) ,ncol=1)
STRUCTURE Results Summary
Below we perform an unconstrained ordination of genetic data for both the neutral and the full datasets using PCA (implemented in ade4 and adegenet)
first we run pca only on the neutral dataset
neutral_genind <- genind_2.1[loc=neutral_loci_names]
#rename pops for mpublication quality figure
neutral_genind$pop <- as.factor(str_replace(neutral_genind$pop, "Summer", "Early-Summer"))
neutral_genind$pop <- as.factor(str_replace(neutral_genind$pop, "fall", "Late-Summer"))
neutral_genind$pop <- as.factor(str_replace(neutral_genind$pop, "halfpounder", "Half-Pounder"))
#reorder pop factor
neutral_genind$pop <- factor(neutral_genind$pop, levels = c("Late-Summer", "Half-Pounder", "Early-Summer", "Winter"))
#replace missing data using mean allele frequency
neutral_genind_scale <- scaleGen(neutral_genind, NA.method="mean")
# run the PCA, dudi.pca uses an interactive prompt to choose pcs to retain, we retain all in order to run some analyses on eigenvalues
pca1 <- dudi.pca(neutral_genind_scale,cent=FALSE,scale=FALSE, scannf = FALSE, nf = 332)
barplot(pca1$eig[1:332],main="PCA eigenvalues")
s.class(pca1$li, pop(neutral_genind),xax=1,yax=2, col=transp(viridisLite::viridis(4),0.7), axesell=FALSE, cstar=0, cpoint=1, grid=FALSE)
add.scatter.eig(pca1$eig[1:10],nf=3,xax=1,yax=2, posi = "bottomright" )
#publication quality figure
# pca_vals <- pca1$li[,c(1:2)]
# pca_vals <- pca_vals %>%
# rownames_to_column(var = "Sample") %>%
# left_join(select(genos_2.3, Sample, run)) %>%
# mutate(run = if_else(run == "fall", "Late-Summer", run)) %>%
# mutate(run = if_else(run == "halfpounder", "Half-Pounder", run)) %>%
# mutate(run = if_else(run == "Summer", "Early-Summer", run))
#
# ggplot(data = pca_vals, aes(Axis1, Axis2, color = run))+geom_point(alpha = 0.5)+stat_ellipse()+theme_classic()+scale_color_viridis_d()+xlab("Principal Component 1")+ylab("Principal Component 2")
# barplot(pca1$eig[1:10],main="PCA eigenvalues")
Let’s also make an interactive 3d plot, since the third PC is not much less informative than the second
plot_ly(x=pca1$li$Axis1, y=pca1$li$Axis2, z=pca1$li$Axis3, type="scatter3d", mode="markers", color=neutral_genind$pop, alpha = 0.8)
I’m interactive!!!
As suggested by our Fst results, there is little differentiation at neutral markers. Also of note is a halfpounder outlier along the first pc. It should not come as a surprise that pca fails to differentiate among populations given the fst. PCA should not be able to differentiate among pops when fst < 1/(sqrt(sample size X number of markers)), which in our case is about 3%, substantially greater than the fst calculated for neutral markers (patterson 2006).
Also note that the primary axis (1 / X) captures variation within fall and half pounedrs that is largely absent from winter and summer runs. Once again, another hint of population structure within halfpounders/fall run fish, but this time suggestive of neutral variation that doesn’t exist within the fall/winter runs.
#replace missing data using mean allele frequency
genind_scale <- scaleGen(genind_2.1, NA.method="mean")
genind_2.3 <- genind_2.1
#rename pops for mpublication quality figure
genind_2.3$pop <- as.factor(str_replace(genind_2.3$pop, "Summer", "Early-Summer"))
genind_2.3$pop <- as.factor(str_replace(genind_2.3$pop, "fall", "Late-Summer"))
genind_2.3$pop <- as.factor(str_replace(genind_2.3$pop, "halfpounder", "Half-Pounder"))
#reorder pop factor
genind_2.3$pop <- factor(genind_2.3$pop, levels = c("Late-Summer", "Half-Pounder", "Early-Summer", "Winter"))
# run the PCA, dudi.pca uses an interactive prompt to choose pcs to retain, we retain all in order to run some analyses on eigenvalues
pca1 <- dudi.pca(genind_scale,cent=FALSE,scale=FALSE, scannf = FALSE, nf = 353)
barplot(pca1$eig[1:353],main="PCA eigenvalues")
s.class(pca1$li, pop(genind_2.3),xax=1,yax=2, col=transp(viridisLite::viridis(4),0.7), axesell=FALSE, cstar=0, cpoint=1, grid=FALSE)
add.scatter.eig(pca1$eig[1:10],nf=3,xax=1,yax=2)
Let’s also make an interactive 3d plot, since the third PC is not much less informative than the second
plot_ly(x=pca1$li$Axis1, y=pca1$li$Axis2, z=pca1$li$Axis3, type="scatter3d", mode="markers", color=genind_2.3$pop, alpha = 0.8)
In the first three most important axes of genetic differentiation using the full dataset we observe complete overlap of fall and halfpounder fish, winter and summer fish are clearly separated, but the cloud of fall-halfpounder fish is inclusive of both of these groups. Three clusters of data are present in the data, (1) a winter-like cluster also including some fall run and half pounders, (2) a summer-like cluster also including some fall and half pounder fush and (3) a uniquely half-pounder fall run cluster positioned intermediate between the two.
Also, just like the neutral PCA, this captures some variation unique to fall/halfpounders. This time this variance is pushed back to the third PC, instead of the first.
Here we pull a little harder to find some population structure. We conduct a discriminant analysis on PCA transformed genetic variation (DAPC).
First we will use k-means clustering to infer the number of genetic clusters in the dataset without applying a priori assumptions about the number of clusters or population assignments in the sample datset. Then we will check if these groups match well with assigned pops before proceeding to the DAPC.
Below are plots of bayesian information criterion across different numbers of clusters and cluster membership across a range of most-likely clusters:
#k means clustering, keep a lot (330 pcs) (kmeans won't overfit with two many pcs)
#note the number of clusters was chosen interactively, the code below executes the clustering using the best k
clusts <- find.clusters(genind_2.3, n.pca = 330, choose.n.clust = FALSE)
#plot BIC
bic <- as.data.frame(cbind(c(1:length(clusts$Kstat)), clusts$Kstat))
ggplot(bic)+geom_point(aes(x=V1, y=V2))+geom_line(aes(x=V1, y=V2))+theme_classic()+xlim(c(0, 50))+xlab("k")+ylab("BIC")
BIC for k-means clustering across different numbers of genetic clusters
clusts <- find.clusters(genind_2.3, n.pca = 330, n.clust = 2)
kable(table(pop(genind_2.3), clusts$grp),caption = "a priori population vs genetic cluster" )
| 1 | 2 | |
|---|---|---|
| Late-Summer | 86 | 71 |
| Half-Pounder | 301 | 342 |
| Early-Summer | 42 | 0 |
| Winter | 0 | 40 |
clusts <- find.clusters(genind_2.3, n.pca = 330, n.clust = 3)
kable(table(pop(genind_2.3), clusts$grp),caption = "a priori population vs genetic cluster" )
| 1 | 2 | 3 | |
|---|---|---|---|
| Late-Summer | 68 | 44 | 45 |
| Half-Pounder | 170 | 212 | 261 |
| Early-Summer | 0 | 42 | 0 |
| Winter | 7 | 0 | 33 |
clusts <- find.clusters(genind_2.3, n.pca = 330, n.clust = 4)
kable(table(pop(genind_2.3), clusts$grp), caption = "a priori population vs genetic cluster" )
| 1 | 2 | 3 | 4 | |
|---|---|---|---|---|
| Late-Summer | 49 | 24 | 67 | 17 |
| Half-Pounder | 175 | 199 | 172 | 97 |
| Early-Summer | 7 | 0 | 0 | 35 |
| Winter | 0 | 33 | 7 | 0 |
The model fit is best at k=4. BIC tends to decrease until best fit only in a perfect island model, in practice, best K is often found at the lowest K that is a substantial improvement from the last. Here we use k=4, but when other k were used, generally the same result: Never any overlap between winter and summer run fish, fall and halfpounders always distributed among the 4 groups.
This result fits with our previous findings, there is structure and high diversity within fall and halfpounders (see heterozygosity section) and substantial overlap between fall-halfpounders and both winter and summer fish, but not between winter and summer fish.
This creates a complex scenario for conducting the DAPC, should we use a priori assigned populations as the basis of our DA? Simulation stuides (eg miller et al 2020) suggests that at low fst, the ability of k means clustering to succesfully recapitulate real genetic clusters is low, however, DA on biologically inaccurate grouping variables is not meaningful… Let’s do both for now as they are both informative in different ways.
#first optimize the PCs retained based on the a.score, so run dapc on the full pcs
#invisible(dapc_full_denovo <- dapc(genind_2.1, clusts$grp, n.pca = 330, n.da = 3 ))
#optim.a.score(dapc_full_denovo)
#nice now we use the optimized a score to run our dapc for real
dapc_full_denovo <- dapc(genind_2.3, clusts$grp, n.pca = 57, n.da = 3 )
plot_data <- as.data.frame(cbind(dapc_full_denovo$ind.coord, genind_2.3$pop, clusts$grp))
colnames(plot_data) <- c("LD1", "LD2", "LD3", "pop", "grp")
plot_data$pop <- as.factor(plot_data$pop)
plot_data$grp <- as.factor(plot_data$grp)
ggplot(data=plot_data)+geom_point(aes(x=LD1, y=LD2, color=pop))+scale_color_viridis_d(labels = c("Late-Summer", "Half-Pounder", "Early-Summer", "Winter"))+theme_classic()+ggtitle("DAPC by k-means cluster, color by a priori population")+guides(color=guide_legend("Run-Type"))
ggplot(data=plot_data)+geom_point(aes(x=LD1, y=LD2, color=grp))+scale_color_viridis_d()+theme_classic()+ggtitle("DAPC by cluster, colored by k-means cluster")+guides(color=guide_legend("K-means cluster"))
marker_loadings1 <- loadingplot(dapc_full_denovo$var.contr, axis=1,thres=.005, lab.jitter=1, main = "DA axis 1 loading plot")
markers1 <- unique(substr(names(marker_loadings1$var.values),1,nchar(names(marker_loadings1$var.values))-2))
marker_loadings2 <- loadingplot(dapc_full_denovo$var.contr, axis=2,thres=.02, lab.jitter=1, main = "DA axis 2 loading plot")
markers2 <- unique(substr(names(marker_loadings2$var.values),1,nchar(names(marker_loadings2$var.values))-2))
kable(marker_mapping2 %>%
filter(marker %in% markers1 ) %>%
select(marker, `Presumed Type`), caption = "markers heavily loaded into first discriminant axis")
| marker | Presumed Type |
|---|---|
| Chr28_11667578 | Adaptive. Run Timing |
| OmyR14589 | Adaptive. Residency vs anadromy |
| Omy_bcAKala-380rd | Neutral |
| OmyR24370 | Adaptive. Residency vs anadromy |
| OmyR40252 | Adaptive. Residency vs anadromy |
| OmyR19198 | Adaptive. Residency vs anadromy |
| OmyR33562 | Adaptive. Residency vs anadromy |
| Chr28_11607954 | Adaptive. Run Timing |
| Omy_GREB1_05 | Adaptive. Run timing |
| Chr28_11625241 | Adaptive. Run Timing |
| Chr28_11658853 | Adaptive. Run Timing |
| Omy_RAD15709-53 | Adaptive. Natal site Isothermality. Basin-wide, run-time - related |
| Chr28_11676622 | Adaptive. Run Timing |
| Chr28_11683204 | Adaptive. Run Timing |
| Chr28_11702210 | Adaptive. Run Timing |
kable(marker_mapping2 %>%
filter(marker %in% markers2 ) %>%
select(marker, `Presumed Type`), caption = "markers heavily loaded into second discriminant axis")
| marker | Presumed Type |
|---|---|
| OmyR14589 | Adaptive. Residency vs anadromy |
| Omy_bcAKala-380rd | Neutral |
| OmyR24370 | Adaptive. Residency vs anadromy |
| OmyR40252 | Adaptive. Residency vs anadromy |
| OmyR19198 | Adaptive. Residency vs anadromy |
| OmyR33562 | Adaptive. Residency vs anadromy |
As suspected, given the the fact that fall and halfpounder fish do not fall into a single genetic cluster, de novo assigned genetic cluters do not do a great job of differentiating among a priori assigned run-timing groups. When breaking the fish into four groups these groups are mostly defined by covariation at ~15 markers including run timing markers, a neutral marker (linked to residency markers), and residency markers. Note that the loadings for the run-timing markers are ~2 fold greater than other markers included in the table above and that this cutoff is arbitrary.
A note on the best number of pcs
Given that (a) we are using a priori assigned groups to assess differences among said groups and (b) the number of predictors is only slightly less than the number of object (n = 882 and p= ~350), we have to be really cautious about overfitting here, so we’re going to run both cross-fold validation and the “a.score” procedure to determine the correct number of pcs to retain, then we will retain pcs according to which optimization strategy suggests the lower number of pcs. One issue here is that previous results suggest that at least one and potentially two (fall and half pounders) of our a priori groups is a synthetic (i.e. non exclusively genetic) assemblage of individuals. If this is the case, even using approaches such as the a score or cross validation will lead to overfitting w respect to the underlying biology and our choice of n.pc should attempt to limit overfitting by choosing the minimum PCs possible that still strongly discriminate.
These are the results from the original run (with two additional run-timing markers):
The best pca according to optim.a.score was 112, but there was only extremely slight improvement at each pc beyond 12, where we observed a local maximum. Best pca according to cross validation using overall group assignment success was 16, however cross validation using mean group assignment per group was 121. Similar to the a.score approach, there is not a strong “arc” pattern in the latter as one might suspect for a cross validation procedure, instead fit improves rapidly after the first few pcs, declines, then and continues a gradual improvement. WE considered this to be a potential example of double descent, as once the number of PCs exceeds the group sample size, the model is overparametized (p > n). We chose the number of PCs that correspond to this first peak in both the xval and optim.a.score procedure: 12. However, it should be noted that running DAPC with widely varying PCs did not qualitatively change the pattern of results. The degree of discrimination increased with increasing PCs and loadings on individual markers varied, but not the qualitative relationships among a priori assigned populations.
Final results:
When we removed the two troublesome markers, the a.score results were nearly exactly the same. Best pca was 113, but there was only extremely slight improvement at each pc beyond 12, where we observed a local maximum. Best pca according to cross validation using overall group assignment success was 12, however cross validation using mean group assignment per group was 122. Similar to the a.score approach, there is not a strong “arc” pattern in the latter as one might suspect for a cross validation procedure, instead fit improves rapidly after the first few pcs, declines, then and continues a gradual improvement. WE considered this to be a potential example of double descent, as once the number of PCs exceeds the group sample size, the model is overparametized (p > n) and this is precisely where the increase in xval success rate began to increase. We chose the number of PCs that correspond to this first peak in both the xval and optim.a.score procedure: 12. However, it should be noted that running DAPC with widely varying PCs did not qualitatively change the pattern of results. The degree of discrimination increased with increasing PCs and loadings on individual markers varied, but not the qualitative relationships among a priori assigned populations.
Now the DAPC
#first optimize the PCs retained based on the a.score, so run dapc on the full pcs
#invisible(dapc_full_apriori <- dapc(genind_2.3, n.pca = 330, n.da = 3 ))
#to speed this up in future knits, we ran optim.a.score once, but not in the notebook
#optim.a.score(dapc_full_apriori, smart=FALSE)
#nice now we use the optimized a score to run our dapc for real
#mat <- as.matrix(scaleGen(genind_2.3, NA.method="mean", scale=FALSE, center=FALSE))
#xpop <- pop(genind_2.3)
#xval <- xvalDapc(mat, xpop, n.pca.max = 200, training.set = 0.9, result = "overall", center = TRUE, scale = FALSE, n.pca = seq(1,300, length.out = 60), n.rep = 50, xval.plot = TRUE)
invisible(dapc_full_apriori <- dapc(genind_2.3, n.pca = 12, n.da = 3 ))
plot_data <- as.data.frame(dapc_full_apriori$ind.coord)
plot_data$pop <- as.character(genind_2.3$pop)
ggplot(data=plot_data)+geom_point(aes(x=LD1, y=LD2, color = pop), alpha = 0.5, size = 2)+theme_classic()+scale_color_viridis_d()+stat_ellipse(aes(x=LD1, y=LD2, color = pop))
scatter(dapc_full_apriori)
marker_loadings1 <- loadingplot(dapc_full_apriori$var.contr, axis=1,thres=.01, lab.jitter=1, main = "laoding plot for DA 1")
markers1 <- unique(substr(names(marker_loadings1$var.values),1,nchar(names(marker_loadings1$var.values))-2))
marker_loadings2 <- loadingplot(dapc_full_apriori$var.contr, axis=2,thres=.006, lab.jitter=1, main = "laoding plot for DA 2")
markers2 <- unique(substr(names(marker_loadings2$var.values),1,nchar(names(marker_loadings2$var.values))-2))
kable(marker_mapping2 %>%
filter(marker %in% markers1 ) %>%
select(marker, `Presumed Type`), caption = "Markers that load strongly onto first DA")
| marker | Presumed Type |
|---|---|
| Chr28_11667578 | Adaptive. Run Timing |
| Omy_128996-481 | Neutral |
| Omy_GREB1_05 | Adaptive. Run timing |
| Chr28_11625241 | Adaptive. Run Timing |
| Chr28_11658853 | Adaptive. Run Timing |
| Omy_RAD15709-53 | Adaptive. Natal site Isothermality. Basin-wide, run-time - related |
| Chr28_11676622 | Adaptive. Run Timing |
| Chr28_11683204 | Adaptive. Run Timing |
| Chr28_11702210 | Adaptive. Run Timing |
kable(marker_mapping2 %>%
filter(marker %in% markers2 ) %>%
select(marker, `Presumed Type`, chr, start), caption = "Markers that load strongly onto the second DA")
| marker | Presumed Type | chr | start |
|---|---|---|---|
| Omy_112301-202 | Neutral | 3 | 37590514 |
| Omy_RAD31408-67 | Adaptive. Basin-wide, top-outlier; Stacks Locus ID is 31408_14 | 16 | 48340989 |
| Omy_103705-558 | Neutral | 17 | 7065921 |
| OmyR14589 | Adaptive. Residency vs anadromy | 5 | 56162739 |
| Omy_RAD42465-32 | Adaptive. Maxiumum air temperature (warmest quarter). Basin-wide, top-outlier | 18 | 25034263 |
| Omy_RAD60135-12 | Adaptive. Migration Distance to Ocean. Basin-wide, top-outlier | 6 | 56110047 |
| Omy_RAD50632-21 | Adaptive. Basin-wide, top-outlier | 1 | 38986481 |
| Omy_G3PD_2-371 | Neutral | 2 | 6083880 |
| OmyR24370 | Adaptive. Residency vs anadromy | 5 | 28579317 |
| OmyR40252 | Adaptive. Residency vs anadromy | 5 | 31675237 |
| OmyR19198 | Adaptive. Residency vs anadromy | 5 | 34973454 |
| OmyR33562 | Adaptive. Residency vs anadromy | 5 | 47337494 |
| Omy_SECC22b-88 | Neutral. Possible linkage | 5 | 61828845 |
| Omy_vamp5-303 | Neutral | 6 | 33625089 |
| Omy_RAD76570-62 | Adaptive. Minimum annual precipitation. Basin-wide, precipitation-related; | 12 | 56297712 |
| Omy_101993-189 | Neutral | 17 | 21491237 |
| Omy_RAD43612-42 | Neutral | 18 | 29118747 |
| Omy_LDHB-2_i6 | Neutral | 21 | 24130489 |
| Chr28_11683204 | Adaptive. Run Timing | 28 | 11683151 |
| Chr28_11702210 | Adaptive. Run Timing | 28 | 11702168 |
# for ms figures
# ggplot(data=plot_data)+geom_point(aes(x=LD1, y=LD2, color = pop), alpha = 0.5, size = 2)+theme_classic()+scale_color_viridis_d(name = "Run",labels = c("Late-Summer", "Half-Pounder", "Early-Summer", "Winter" ))+stat_ellipse(aes(x=LD1, y=LD2, color = pop))
#for ms tables with loading value
#marker_loadings2a <- loadingplot(dapc_full_apriori$var.contr, axis=2,thres=.00000001, lab.jitter=1, main = "laoding plot for DA 2")
#write.table(as.data.frame(marker_loadings2a$var.values), "load.txt", quote = FALSE, sep = "\t" )
#write.table(select(marker_mapping2, marker, `Presumed Type`, chr), "map.txt", quote=FALSE, sep = "\t")
DAPC largely discriminates among groups along the first discriminant axis. Fall and half pounder fish demonstrate little to no differentiation along any DA. The first DA strongly separates winter and summer run fish, and is driven by by 9 markers including 8 run timing associated markers and a neutral marker (Omy_128996-481). The second da captures much less variation among groups (79% vs 16%), and is driven largely by 5 residency markers and 17 additional markers including neutral markers and several associated with residency/anadromy and other adapative traits. Interstingly one of the “neutral” annotated markers (Omy_LDHB-2…) have been associated with residency vs anadromy in an association study in the klickitat river (doi.org/10.1080/00028487.2011.588131).
Next we use this a priori DAPC to attempt to classify half pounders into early-summer-like, winter-like or intermediate groups. Note: did this two ways, in the first we run a true classfiication scheme: exclude the fall and half pounder fish, run create a classifier then predict the fall and half pounder membership based on this classifier, however if fall run fish are a real group, the classifier should include all three or four groups and therefore would default back to the original DAPC. I left the true classifier (just winter vs summer) in the notebook, but I think we should just present the full a priori DAPC results. In any case, the effect on the end result is marginal
classifier <- dapc(genind_2.1[pop = c("Winter", "Summer")], n.da = 2, n.pca = 1) #optim a score is 1
pred.half<- predict.dapc(classifier, newdata=genind_2.1[pop=c("halfpounder" , "fall")])
preds <- as.data.frame(cbind(pred.half$posterior, pred.half$ind.scores))
#hmm pop info is not moving over easilt left just merge it from another df
colnames(fstat)[1] <- "pop"
pops_info <- fstat %>%
rownames_to_column(var="sample") %>%
select(sample, pop)
preds <- preds %>%
rownames_to_column(var="sample") %>%
left_join(pops_info)
ld1 <- as.data.frame(classifier$ind.coord) %>%
bind_rows(as.data.frame(pred.half$ind.scores)) %>%
rownames_to_column(var="sample") %>%
left_join(pops_info)
ld1_2 <- as.data.frame(dapc_full_apriori$ind.coord) %>%
rownames_to_column(var="sample") %>%
left_join(pops_info)
#plot
ggplot(data=ld1)+geom_density(aes(LD1, fill=pop, color=pop), alpha = 0.2)+scale_color_viridis_d()+scale_fill_viridis_d()+theme_classic()+ggtitle("True Classifier")
ggplot(data=ld1_2)+geom_density(aes(LD1, fill=pop, color=pop), alpha = 0.2)+scale_color_viridis_d(labels = c("Late-Summer", "Half-Pounder", "Early-Summer", "Winter"))+scale_fill_viridis_d(labels = c("Late-Summer", "Half-Pounder", "Early-Summer", "Winter"))+theme_classic()+ggtitle("A priori DAPC results")+labs(color = "Run", fill = "Run")
# assignments using classifier
CIs <- ld1 %>%
group_by(pop) %>%
summarise(loCI = quantile(LD1, probs = 0.025),
hiCI = quantile(LD1, probs = 0.975))
#number of half pounders that fall in the 95% credible interval for winter fish assignment
kable(ld1 %>%
filter(pop == "halfpounder") %>%
summarise(winter_assigned = sum((LD1 < CIs$hiCI[4])), summer_assigned = sum(( LD1 > CIs$loCI[3])), unassigned = sum((LD1 <= CIs$loCI[3] & LD1 >= CIs$hiCI[4])) ), caption = "True Classifier - Half Pounders")
| winter_assigned | summer_assigned | unassigned |
|---|---|---|
| 164 | 73 | 406 |
# assignments using original DAPC
CIs <- ld1_2 %>%
group_by(pop) %>%
summarise(loCI = quantile(LD1, probs = 0.025),
hiCI = quantile(LD1, probs = 0.975))
#number of half pounders that fall in the 95% credible interval for winter fish assignment
kable(ld1_2 %>%
filter(pop == "halfpounder") %>%
summarise(winter_assigned = sum((LD1 < CIs$hiCI[4])), summer_assigned = sum(( LD1 > CIs$loCI[3])), unassigned = sum((LD1 <= CIs$loCI[3] & LD1 >= CIs$hiCI[4])) ), caption = "A priori DAPC - Half Pounders")
| winter_assigned | summer_assigned | unassigned |
|---|---|---|
| 350 | 54 | 239 |
#same as above but for fall fish
CIs <- ld1 %>%
group_by(pop) %>%
summarise(loCI = quantile(LD1, probs = 0.025),
hiCI = quantile(LD1, probs = 0.975))
#number of half pounders that fall in the 95% credible interval for winter fish assignment
kable(ld1 %>%
filter(pop == "fall") %>%
summarise(winter_assigned = sum((LD1 < CIs$hiCI[4])), summer_assigned = sum((LD1 > CIs$loCI[3])), unassigned = sum((LD1 <= CIs$loCI[3] & LD1 >= CIs$hiCI[4])) ), caption = "True Classifier - Fall Run")
| winter_assigned | summer_assigned | unassigned |
|---|---|---|
| 18 | 8 | 131 |
# assignments using original DAPC
CIs <- ld1_2 %>%
group_by(pop) %>%
summarise(loCI = quantile(LD1, probs = 0.025),
hiCI = quantile(LD1, probs = 0.975))
#number of half pounders that fall in the 95% credible interval for winter fish assignment
kable(ld1_2 %>%
filter(pop == "fall") %>%
summarise(winter_assigned = sum((LD1 < CIs$hiCI[4])), summer_assigned = sum((LD1 > CIs$loCI[3])), unassigned = sum((LD1 <= CIs$loCI[3] & LD1 >= CIs$hiCI[4])) ), caption = "A priori DAPC - Fall Run")
| winter_assigned | summer_assigned | unassigned |
|---|---|---|
| 74 | 5 | 78 |
When we use the original DAPC (not the classifier, still need to make a decision here) we attempt to classify fish into winter or summer like using the first DA which accounts for ~80% of the among group variance and is driven almost exclusively by known run timing associated markers. We then draw 95% credible intervals for winter and summer run and count how many fall and half pounder fish fall beyond the lower bounds for these intervals (i.e. at least as winter-like or summer-like as the 95% CI). For fall fish, the majority are winter assigned, nearly as many are unassigned and a small portion are summer assignmed. The same pattern is observed among the half pounders, but ratio number of unassigned fish is somewhat lower.
There is evidence of weak differentiation among populations at neutral markers from the estimation of Fst. Here we will attempt to magnify these differences using a discirminat analaysis among a priori assigned run types.
#neutraldapc <- dapc(neutral_genind, n.pca = 226, n.da= 3 )
#optim.a.score(neutraldapc)
#nice now we use the optimized a score to run our dapc for real
invisible(neutraldapc <- dapc(neutral_genind, n.pca = 60, n.da = 3 ))
plot_data <- as.data.frame(neutraldapc$ind.coord)
plot_data$pop <- as.character(genind_2.1$pop)
#scatter.dapc(neutraldapc)
plot_ly(x=plot_data$LD1, y=plot_data$LD2, z=plot_data$LD3, type="scatter3d", mode="markers", color=genind_2.1$pop, alpha = 0.8)
#pub quality figure
ggplot(data=plot_data)+geom_point(aes(x=LD1, y=LD2, color = pop), alpha = 0.5, size = 2)+theme_classic()+scale_color_viridis_d(name = "Run",labels = c("Late-Summer", "Half-Pounder", "Early-Summer", "Winter" ))+stat_ellipse(aes(x=LD1, y=LD2, color = pop))
DA axis 1 separates summer from half pounder, fall and winter. DA 2 separates summer from the rest. DA3 also separates summer from the rest. Interestingly this is the same pattern as obsevred using the full dataset, despite the fact that the first DA was driven almost exclusively by run timing markers. This suggests either LD in our dataset between run timing and neutral markers, or that restricted gene flow tied to run timing leads differentiation at neutral markers. In any case it follows with the neutral Fst a that half pounders and fall run fish are more similar to winter run than summer run across the genome. I’ll also add that this pattern looks like overfitting to me.
K means clustering always assigned half pounders and fall run fish to all genetic clusters, while always segregating winter and summer run fish. This suggests there is substantial structure with fall and half pounder fish. Similarly, when we attempt to discriminate among a priori assigned groups, the genetic variation contained within adults that make a fall run, as well as half-pounders is nearly entirely inclusive of winter run fish, and partially inclusive of summer run fish, however, winter and summer run fish are well discriminated.
In other words:
(a) winter and summer run fish can clearly be distinguished from one another using the genetic variation captured by our GTseq panel. Genetic variation among these two groups is driven by markers with known run-timing associations (b) fall run and half pounder fish can not be discriminated from each other using the genetic information we half available
(c) fall run and half pounder fish demonstrate substantial within-“population” structure, and include fish that demonstrate a high degree of genetic similarity with both winter run and summer run fish, hwoever, most individuals demonstrate intermediate genetic charactistics. interestingly,this within population structure is driven neutral markers and residency associated markers
Let’s look specifically at run-timing associated markers
marker_mapping <- readxl::read_xlsx("./metadata/final_mapping_results.xlsx", sheet = 1)
run_timing_loci_names <- marker_mapping %>%
filter(chr == "28" | CRITFC_chromosome == "28") %>%
filter(str_detect(`Presumed Type`, 'run|Run')) %>%
select(marker)
#different naming convention, lets fix
run_timing_loci_names <- str_replace(run_timing_loci_names$marker, "Omy28", "Chr28")
#run_timing_loci <- genos_2.3 %>%
# select(Sample, Date, run, year, one_of(run_timing_loci_names))
#genind_pop <- seppop(genind_2.1)
#run_timing_fall <- genind_pop$fall[loc=run_timing_loci_names]
#run_timing_half <- genind_pop$halfpounder[loc=run_timing_loci_names]
#run_timing_winter <- genind_pop$Winter[loc=run_timing_loci_names]
#run_timing_summer <- genind_pop$Summer[loc=run_timing_loci_names]
Are any markers diagnostic (fixed or nearly fixed within a run-timing group)? Below we plot:
(a) a full heatmap of allele frequencies
(b) heatmap just at near diagnostic markers (0.9 vs 0.1 across winter and summer runs)
(c) a heatmap of all markers with run-timing associations
#because adegenet can be so difficult to use, let take advantage of popgenreport to collect our summary data
all_counts <- allele.dist(genind_2.1, mk.figures = FALSE)$count
#make into a dataframe
all_counts <- as.data.frame(do.call(rbind, all_counts))
colnames(all_counts) <- c("fall_count", "half_count", "summer_count", "winter_count")
all_counts$sum <- rowSums(all_counts, na.rm = TRUE)
all_freqs <- allele.dist(genind_2.1, mk.figures = FALSE)$frequency
#make into a dataframe
all_freqs <- as.data.frame(do.call(rbind, all_freqs))
all_freqs <- as.data.frame(cbind(all_freqs, all_counts))
##### get only minor allele
all_freqs$marker <- genind_2.1$loc.fac
#now group by marker and keep the minor allele, then convert counts to
all_maf <- all_freqs %>%
group_by(marker) %>%
slice_min(sum) %>%
replace(., is.na(.), 0)
#filter all maf to include only diagnostic or near diagnostic (0.95) markers between winter and summer runs
diagmaf <- all_maf %>%
filter((Summer > 0.9 & Winter < 0.1) | (Summer < 0.1 & Winter > 0.9))
# plot big heatmap
tmat <- t(as.matrix(all_maf[,1:4]))
colnames(tmat) <- all_maf$marker
pheatmap(tmat, show_colnames = F)
#plot as a heatmap
col_pal<- colorRampPalette(c("red", "white", "blue"))(256)
tmat <- t(as.matrix(diagmaf[,1:4]))
colnames(tmat) <- diagmaf$marker
pheatmap(tmat, cluster_cols = FALSE)
#because adegenet can be so difficult to use, let take advantage of popgenreport to collect our summary data
run_timing_counts <- allele.dist(genind_2.1[loc=run_timing_loci_names], mk.figures = FALSE)$count
#make into a dataframe
run_timing_counts <- as.data.frame(do.call(rbind, run_timing_counts))
colnames(run_timing_counts) <- c("fall_count", "half_count", "summer_count", "winter_count")
run_timing_counts$sum <- rowSums(run_timing_counts, na.rm = TRUE)
run_timing_freqs <- allele.dist(genind_2.1[loc=run_timing_loci_names], mk.figures = FALSE)$frequency
#make into a dataframe
run_timing_freqs <- as.data.frame(do.call(rbind, run_timing_freqs))
run_timing_freqs <- as.data.frame(cbind(run_timing_freqs, run_timing_counts))
##### get only minor allele
#then put the marker names in the same order as the allele.dist (exlude the missing marker first)
run_timing_loci_names <- run_timing_loci_names[run_timing_loci_names != "Omy_RAD52458-17"]
run_timing_freqs$marker <- rep(run_timing_loci_names[order(run_timing_loci_names[-c(8,10)])], each = 2)
#now group by marker and keep the minor allele, then convert counts to
run_timing_maf <- run_timing_freqs %>%
group_by(marker) %>%
slice_min(sum) %>%
replace(., is.na(.), 0)
#plot as a heatmap
col_pal<- colorRampPalette(c("red", "white", "blue"))(256)
tmat <- t(as.matrix(run_timing_maf[,1:4]))
colnames(tmat) <- run_timing_maf$marker
pheatmap(tmat, cluster_cols = FALSE)
#repolarize greb05 and 7080-54
repol <- run_timing_maf %>%
mutate(fall = ifelse(marker == "Omy_GREB1_05", (1 - fall), fall)) %>%
mutate(halfpounder = ifelse(marker == "Omy_GREB1_05", (1 - halfpounder), halfpounder)) %>%
mutate(Summer = ifelse(marker == "Omy_GREB1_05", (1 - Summer), Summer)) %>%
mutate(Winter = ifelse(marker == "Omy_GREB1_05", (1 - Winter), Winter)) #%>%
# mutate(fall = ifelse(marker == "Omy_RAD47080-54", (1 - fall), fall)) %>%
# mutate(halfpounder = ifelse(marker == "Omy_RAD47080-54", (1 - halfpounder), halfpounder)) %>%
# mutate(Summer = ifelse(marker == "Omy_RAD47080-54", (1 - Summer), Summer)) %>%
# mutate(Winter = ifelse(marker == "Omy_RAD47080-54", (1 - Winter), Winter))
repol <- repol %>%
select(fall, halfpounder, Winter, Summer, marker) %>%
rename("Late-Summer" = fall, "Half-Pounder" = halfpounder, "Early-Summer" = Summer)
col_pal<- colorRampPalette(c("red", "white", "blue"))(256)
tmat <- t(as.matrix(repol[,1:4]))
colnames(tmat) <- repol$marker
pheatmap(tmat)
Of the 12 markers with known run-timing associations, 7 are fixed for alternative alleles in winter and summer run fish. One additional marker (greb1_05) is highly informative, but is not fixed in winter run fish. Fall run fish and half pounder demonstrate intermediate genotypes and very little differentiation from one another.
Sometimes it can be helpful to look at the pattern across individuals and along the genome rather than allele frequency. Below we make a graphical representation of some of the genotypes at run timing associated markers as well.
#first we need to get a dataset that will work
#the tab slot of the adegenet object will work if we polarize the data by selecting the allele with the higher count in either winter or early summer
#for each pair of columns, keep the one with the highest average, few enough markers, we can just choose these manually
colSums(chr28_seppop$Winter$tab, na.rm = TRUE)
## Chr28_11607954.G Chr28_11607954.A Chr28_11625241.A Chr28_11625241.G
## 62 18 0 80
## Chr28_11632591.G Chr28_11632591.A Chr28_11658853.A Chr28_11658853.C
## 63 17 0 80
## Chr28_11667578.T Chr28_11667578.C Chr28_11676622.T Chr28_11676622.G
## 0 80 0 80
## Chr28_11683204.G Chr28_11683204.T Chr28_11702210.T Chr28_11702210.G
## 0 80 0 80
## Chr28_11773194.A Chr28_11773194.T Omy_GREB1_05.T Omy_GREB1_05.G
## 72 8 14 66
## Omy_GREB1_09.G Omy_GREB1_09.T Omy_RAD15709-53.G Omy_RAD15709-53.A
## 80 0 80 0
#bind these together
polarized_allele_counts <- as.data.frame(chr28_seppop$Winter$tab[,c(1,4,5,8,10,12,14,16,17,20,21,23)]) %>%
bind_rows(as.data.frame(chr28_seppop$Summer$tab[,c(1,4,5,8,10,12,14,16,17,20,21,23)])) %>%
bind_rows(as.data.frame(chr28_seppop$fall$tab[,c(1,4,5,8,10,12,14,16,17,20,21,23)])) %>%
bind_rows(as.data.frame(chr28_seppop$halfpounder$tab[,c(1,4,5,8,10,12,14,16,17,20,21,23)]))
#plot heatmap
#first, order the markers according to genomic position
map_results <- readxl::read_xlsx("metadata/final_mapping_results.xlsx", sheet = 1)
map_results <- map_results %>%
mutate(marker = str_replace(marker, "Omy28", "Chr28"))
colnames(polarized_allele_counts) <- substr(colnames(polarized_allele_counts), 0, nchar(colnames(polarized_allele_counts))-2)
marker_pos <- map_results %>%
select(marker, CRITFC_SNP_pos_genome) %>%
filter(marker %in% colnames(polarized_allele_counts))
#hmm one marker position is missing, just add it manually
marker_pos$CRITFC_SNP_pos_genome <- as.numeric(marker_pos$CRITFC_SNP_pos_genome)
## Warning: NAs introduced by coercion
marker_pos[12,2] <- 11702210
#order to columns
polarized_allele_counts <- polarized_allele_counts %>%
select(unlist(c(marker_pos[order(marker_pos$CRITFC_SNP_pos_genome),1]), use.names = FALSE))
#add pop info
polarized_allele_counts$pop <- c(rep("winter", nrow(chr28_seppop$Winter$tab)), rep("summer", nrow(chr28_seppop$Summer$tab)), rep("fall", nrow(chr28_seppop$fall$tab)), rep("halfpounder", nrow(chr28_seppop$halfpounder$tab)))
#split into groups for easy visualization
winter_pac <- filter(polarized_allele_counts, pop == "winter")
summer_pac <- filter(polarized_allele_counts, pop == "summer")
fall_pac <- filter(polarized_allele_counts, pop == "fall")
halfpounder_pac <- filter(polarized_allele_counts, pop == "halfpounder")
a = pheatmap(winter_pac[,-13], cluster_cols = FALSE, show_rownames = FALSE, main = "winter")
b = pheatmap(summer_pac[,-13], cluster_cols = FALSE, show_rownames = FALSE, main = "summer")
c = pheatmap(fall_pac[,-13], cluster_cols = FALSE, show_rownames = FALSE, main = "fall")
d = pheatmap(halfpounder_pac[,-13], cluster_cols = FALSE, show_rownames = FALSE, main = "halfpounder")
#same as above but for diag only
fall_pac_diag <- select(fall_pac, -c("Chr28_11607954", "Omy_GREB1_05", "Omy_GREB1_09", "Chr28_11632591", "Chr28_11773194"))
#repolarize the -54 marker
#fall_pac_diag <- fall_pac_diag %>%
# mutate(`Omy_RAD47080-54` = 2 - `Omy_RAD47080-54`)
c = pheatmap(fall_pac_diag[,-8], cluster_cols = FALSE, show_rownames = FALSE, main = "fall")
half_pac_diag <- select(halfpounder_pac, -c("Chr28_11607954", "Omy_GREB1_05", "Omy_GREB1_09", "Chr28_11632591", "Chr28_11773194"))
#repolarize the -54 marker
#fall_pac_diag <- fall_pac_diag %>%
# mutate(`Omy_RAD47080-54` = 2 - `Omy_RAD47080-54`)
c = pheatmap(half_pac_diag[,-8], cluster_cols = FALSE, show_rownames = FALSE, main = "halfpounder")
# same as above but excluding markers with weird missingness patterns and getting rid of the duplicate (-54)
#fall_pac_diag_mo_miss <- select(fall_pac_diag, -c("Omy_RAD47080-54", "Chr28_11671116"))
#counting all winter fall run
#sum(rowSums(fall_pac_diag_mo_miss[,-8], na.rm = TRUE) ==14)
#counting all summer fall run
#sum(rowSums(fall_pac_diag_mo_miss[,-8], na.rm = TRUE) == 0)
#same for half
#half_pac_diag_mo_miss <- select(half_pac_diag, -c("Omy_RAD47080-54", "Chr28_11671116"))
#counting all winter fall run
#sum(rowSums(half_pac_diag_mo_miss[,-8], na.rm = TRUE) ==14)
#counting all summer fall run
#sum(rowSums(half_pac_diag_mo_miss[,-8], na.rm = TRUE) == 0)
#again but split by year
#half_pac_diag_mo_miss_2018 <- half_pac_diag_mo_miss %>%
# rownames_to_column(var = "Sample") %>%
# left_join(select(genos_2.3, Sample, year)) %>%
# filter(year == "2018")
#half_pac_diag_mo_miss_2019 <- half_pac_diag_mo_miss %>%
# rownames_to_column(var = "Sample") %>%
# left_join(select(genos_2.3, Sample, year)) %>%
# filter(year == "2019")
#here we count the number of samples that uncontroversially fall into summer and winter bins (no missing data, only winter/sumer alleles)
#sum(rowSums(half_pac_diag_mo_miss_2018[,-c(1,9,10)], na.rm = TRUE) ==14)
#sum(rowSums(half_pac_diag_mo_miss_2018[,-c(1,9,10)], na.rm = TRUE) ==0)
#sum(rowSums(half_pac_diag_mo_miss_2019[,-c(1,9,10)], na.rm = TRUE) ==14)
#sum(rowSums(half_pac_diag_mo_miss_2019[,-c(1,9,10)], na.rm = TRUE) ==0)
The second DA in the a priori DAPC captures less among group variation than the first, however, it is oriented along a large axis of genetic variation within both the fall run and halfpounders. Part of this axis is driven by snps with known association with residency vs anadromy. Let’s explore this briefly:
First let’s take a look at the allele frequencies at residency vs anadromy alleles on chromosome 5
residency_loci_names <- marker_mapping %>%
filter(chr == "5" | CRITFC_chromosome == "5") %>%
filter(str_detect(`Presumed Type`, 'residency|Residency')) %>%
select(marker)
residency_loci_names <- residency_loci_names$marker
#one of these is not in e final dataset, remove
residency_loci_names <- residency_loci_names[-6]
# get summary info
residency_counts <- allele.dist(genind_2.1[loc=residency_loci_names], mk.figures = FALSE)$count
#make into a dataframe
residency_counts <- as.data.frame(do.call(rbind, residency_counts))
colnames(residency_counts) <- c("fall_count", "half_count", "summer_count", "winter_count")
residency_counts$sum <- rowSums(residency_counts, na.rm = TRUE)
residency_freqs <- allele.dist(genind_2.1[loc=residency_loci_names], mk.figures = FALSE)$frequency
#make into a dataframe
residency_freqs <- as.data.frame(do.call(rbind, residency_freqs))
residency_freqs <- as.data.frame(cbind(residency_freqs, residency_counts))
##### get only minor allele
#then put the marker names in the same order as the allele.dist (exlude the missing marker first)
residency_freqs$marker <- rep(residency_loci_names[order(residency_loci_names)], each = 2)
# the residency markers were switched on the probe seqs file during genotype calling, let's fix this here, since none of the published results share marker names up to this point, let's leave the analysis as is and fix at this stage.
# here the info from the corrections
"
R40252 and R40319 are correct.
What I labeled R19198 is actually R14589.
What I labeled R14589 is actually R33562.
What I labeled R33562 is actually R24370.
What I labeled R24370 is actually R19198.
"
## [1] "\nR40252 and R40319 are correct.\nWhat I labeled R19198 is actually R14589.\nWhat I labeled R14589 is actually R33562.\nWhat I labeled R33562 is actually R24370.\nWhat I labeled R24370 is actually R19198.\n"
#and here we rename
residency_freqs$marker[c(3,4)] <- "OmyR14589"
residency_freqs$marker[c(1,2)] <- "OmyR33562"
residency_freqs$marker[c(7,8)] <- "OmyR24370"
residency_freqs$marker[c(5,6)] <- "OmyR19198"
#now group by marker and keep the minor allele, then convert counts to
residency_maf <- residency_freqs %>%
group_by(marker) %>%
slice_min(sum) %>%
replace(., is.na(.), 0)
#plot as a heatmap
residency_maf <- residency_maf %>%
rename("Half-pounder" = "halfpounder", "Early-Summer" = "Summer", "Late-Summer" = "fall")
col_pal<- colorRampPalette(c("red", "white", "blue"))(256)
tmat <- t(as.matrix(residency_maf[,1:4]))
colnames(tmat) <- residency_maf$marker
pheatmap(tmat)
Fall and winter fish vary (somewhat: p = 0.3 vs 0.6) at these residency alleles, with summer and halfpounder intermediate.
Let’s look at allele frequency variance at these alleles as well:
res_divs <- filter(marker_divs, marker %in% residency_loci_names)
res_divs <- pivot_wider(res_divs, names_from = obs_exp, values_from = H)
ggplot(data=res_divs)+geom_point(aes(He, Ho, color = run), size = 3, alpha = 0.8)+scale_color_viridis_d()+theme_classic()+geom_abline(aes(slope =1, intercept = 0))+xlim(0,0.5)+ylim(0,0.5)
We also know what alleles are the resident alleles. Let’s repolarize the figure above so that higher freq indicates more resident allele dosage
#we'll go row by row in the resident freqs dataframe and choose the resident allele
pol_res_freq <- residency_freqs[c(1,4,6,8,10),]
#plot as a heatmap
pol_res_freq2 <- pol_res_freq %>%
rename("Half-pounder" = "halfpounder", "Early-Summer" = "Summer", "Late-Summer" = "fall")
col_pal<- colorRampPalette(c("red", "white", "blue"))(256)
tmat <- t(as.matrix(pol_res_freq2[,1:4]))
colnames(tmat) <- pol_res_freq$marker
pheatmap(tmat)
So variation is high within all the runs (least so at fall run), with expected heterozygosity all the way near 0.5. Interestingly there is a dearth of heterozygotes in all but summer run, however, this is significant only for halfpounder where it is the strongest. This pattern suggests that there is some structuring with the halfpounders and to a lesser extent fall run (but not summer and winter run fish), with reduced interbreeding between individuals expressing alternative residency alleles. interpreting the differences across populations suggests there’s varying degrees of introgression between anadromous and resident fish across the run types, with the highest proportion of resident alleles in the late-summer run samples. this may be explained by the temporal and spatial variation in spawning between run types.
also something to think about here is that there is an observed tendency of hatchery released smolts to residualize over the summer, winter fish are graded (i.e. precocially mature males are not released) so it is likely to be the summer hatchery fish contributing the residuals.
also of note here is that our winter sample includes both hatchery and natural origin fihs, while our other samples contain only natural origin, so the pattern here could be due to hatchery alleles.
Quick LD plot using r(bar)D of winter and summer runs (first plot) vs fall and half pounder (second) plot. Will come back and clean this up later. Absence of variation within some runs is causing code to break for plotting and estimating LD in a lot of packages. If we want formal LD, I think I’ll need to manually convert to a plink or vcf format and estimate D or r2 with plink/tassel or other .
chr28_genind <- genind_2.1[loc = run_timing_loci_names]
# first lets calculate LD (dartR has a great (fast) ld estimator that works right on genind files, so let's use this)
#ldreport_winter <- dartR::gl.report.ld(chr28_genind[pop = c("Summer", "Winter")], name = NULL, save = FALSE )
#ldreport_fall <- dartR::gl.report.ld(chr28_genind[pop = c("fall", "halfpounder")], name = NULL, save = FALSE )
#this command is breaking when running on the subset datasets (no variance to do the calcs on), lets settle on poppr's tools for now (eventually will need a better LD estimate)
invisible(ldr <- poppr::pair.ia(chr28_genind[pop = c("Summer", "Winter")], limits = c(-0.1, 1.1)))
invisible(ldr_f <- poppr::pair.ia(chr28_genind[pop = c("fall", "halfpounder")], limits = c(-0.1, 1.1)))
# reorder winter / summer
ldr_f <- rownames_to_column(as.data.frame(ldr_f), var = "rowid")
ldr_f <- ldr_f %>%
separate(rowid, into = c("snp1", "snp2"), sep = ":")
ldr_f$snp1 <- factor(ldr_f$snp1, levels = marker_pos$marker[order(marker_pos$CRITFC_SNP_pos_genome)], ordered = TRUE)
ldr_f$snp2 <- factor(ldr_f$snp2, levels = marker_pos$marker[order(marker_pos$CRITFC_SNP_pos_genome)], ordered = TRUE)
ldr_f_opposite_tri <- ldr_f
colnames(ldr_f_opposite_tri) <- c("snp2", "snp1", "rbarD")
ldr_f <- ldr_f %>%
bind_rows(ldr_f_opposite_tri)
ggplot(ldr_f)+geom_tile(aes(snp1, snp2, fill=rbarD))+theme_classic()+theme(axis.text.x = element_text(angle = 90))+scale_fill_gradient(low = "#bdc3c7", high= "#2c3e50")
# reorder fall_halfpounder
ldr <- rownames_to_column(as.data.frame(ldr), var = "rowid")
ldr <- ldr %>%
separate(rowid, into = c("snp1", "snp2"), sep = ":")
ldr$snp1 <- factor(ldr$snp1, levels = marker_pos$marker[order(marker_pos$CRITFC_SNP_pos_genome)], ordered = TRUE)
ldr$snp2 <- factor(ldr$snp2, levels = marker_pos$marker[order(marker_pos$CRITFC_SNP_pos_genome)], ordered = TRUE)
ldr_opposite_tri <- ldr
colnames(ldr_opposite_tri) <- c("snp2", "snp1", "rbarD")
ldr <- ldr %>%
bind_rows(ldr_opposite_tri)
ggplot(ldr)+geom_tile(aes(snp1, snp2, fill=rbarD))+theme_classic()+theme(axis.text.x = element_text(angle = 90))+scale_fill_gradient(low = "#bdc3c7", high= "#2c3e50")
now let’s do the same but skip the uninformative markers for better visualization
uninf_markers <- c("Chr28_11607954","Omy_GREB1_09", "Chr28_11632591", "Chr28_11773194")
ldr <- ldr %>%
filter(!(snp1 %in% uninf_markers)) %>%
filter(!(snp2 %in% uninf_markers))
ggplot(ldr)+geom_tile(aes(snp1, snp2, fill=rbarD))+theme_classic()+theme(axis.text.x = element_text(angle = 90))+scale_fill_viridis_c( limits = c(0,1))
ldr_f <- ldr_f %>%
filter(!(snp1 %in% uninf_markers)) %>%
filter(!(snp2 %in% uninf_markers))
ggplot(ldr_f)+geom_tile(aes(snp1, snp2, fill=rbarD))+theme_classic()+theme(axis.text.x = element_text(angle = 90))+scale_fill_viridis_c(limits = c(0,1))
Yes, LD is weaker among fall and half pounder than winter/summer.
LD is also about equally strong over the entire length of the region, there are no clear ld blocks within the region.
The identification of “discordant” genotypes at markers in the region on chromosome 28 associated with run timing (i.e. winter and summer run fish demonstrate near perfect linkage among markers in this region, but fall and halfpounders demonstrate reduced LD), suggests that recombination can occur within this region. If we can phase our data we should be able to identify chr28 haplotypes to gain more inference into the evolutionary origin/maintenance of fall/halfpounders.
Outline:
- identify snps in linkage around chr28 and chr05
- phase genotypes for these snps
- cluster phased genos to identify haplogroups and/or build haplotype network to visualize relationships and visually identify clusters
First we need to get phased genotypic data in order to identify haplotypes.
We will use fastPHASE. First lets get data in the fastphase format
#when I reran the analysis after removing the redudant marker and the marker with population specific missingness, instead of making these input files from scratch I simply deleted the rows from the input file for fastphase.
# the fast phase format's genotype data for an individual is expressed on two rows with each row representing an unphased haploid genotype. There is also some metadata added into header rows. the main gt matrix is very similar to the structure format so we'll use the same approach to reformat the data as we used to convert from genind to structure
chr28_genind <- genind_2.1[loc=run_timing_loci_names]
#note just sort of crashed through this with a text editor, not easily logged, but the general idea was transpose the data, split columns (diploid to dual haploid by adding a tab between the values), then convert N to ?, then read the result into R and transpose again
df <- genind2df(chr28_genind)
#reorder to chromosomal order here
#first, order the markers according to genomic position
map_results <- readxl::read_xlsx("metadata/final_mapping_results.xlsx", sheet = 1)
map_results <- map_results %>%
mutate(marker = str_replace(marker, "Omy28", "Chr28"))
marker_pos <- map_results %>%
select(marker, CRITFC_SNP_pos_genome) %>%
filter(marker %in% colnames(df))
#hmm one marker position is missing, just add it manually
marker_pos$CRITFC_SNP_pos_genome <- as.numeric(marker_pos$CRITFC_SNP_pos_genome)
marker_pos[14,2] <- 11702210
#order to columns
df <- df %>%
select(unlist(c(marker_pos[order(marker_pos$CRITFC_SNP_pos_genome),1]), use.names = FALSE))
# now transponse and save
df <- as.data.frame(t(df))
write_tsv(df, "genotype_data/chr28.tmp")
#in text editor, convert NA to ??, then split columns
df <- read_tsv("genotype_data/chr28.tmp", col_names = FALSE)
df <- t(df)
write_tsv(as.data.frame(df), "genotype_data/chr28.fastphase", col_names = FALSE)
#next we remove whitespaces from the genos and label the individuals like so:
# #individual 1 name
# TAGCG...
# TAGCC...
# #individual 2 name
# TAGCG...
# TAGCC...
# this was accomplished by removing all the whitespace in the genos then using the following regex in find and replace
# find: ^(\w+)\t([A-Z?]+)\n^(\w+)\t([A-Z?]+)
# replace: # \1\n\2\n\4
#then we added the header info according to the fastphase manual
# snp pos was taken from the critfc mapping data and manually added to the file
fastphase was run using default setting and attempting to minimize the switch error (Add more details on this for eventual methods)
~/Science/programs/fastPHASE ../genotype_data/chr28.fastphase
Here we will examine the diversity and putative evolutionary history of the haplotypes.
ph_geno <- read_tsv("phasing/phased_genos4_r.txt")
First lets collect some summary info:
ph_geno <- genos_2.3 %>%
select(Sample, run) %>%
left_join(ph_geno, by = c("Sample" = "ind"))
ph_geno %>%
group_by(run) %>%
summarise(count = n_distinct(hap))
## `summarise()` ungrouping output (override with `.groups` argument)
#write_tsv(ph_geno, "phasing/phased_genos_run.txt")
There are 2 unique haplotypes in summer sample, 9 in winter, 44 in fall and 73 in halfpounders. There are 84 total unique haplotypes.
We constructed a haplotype network using the minimum spanning approach in Popart v1.7 (also tried to use pegas, but data import and conversion kept unphasing the data or and scrambling the sample ids across the different haplotypes)
Below is the minimum spanning network of the 84 different haplotypes among greb1l region SNPS inferred using fastphase and popart (msn). Log in code chunk below
knitr::include_graphics('phasing/phased_genos2.phylip.svg')
# log
# this network was produced outside of r using popart 1.7. briefly, raw phased genotypes were converted to a phylip format (./phasing/phased_genos2.phylip) and run type was formatted as the popart trait data format (see ./phasing/traits2.txt), then we imported both, set the colors according to the viridis pallete used throughout the notebook, and ran the minimum spanning network algorithm to infer a graph/network
### DID NOT USE THIS ###
# didn't use any of this because row ids and haplotype assignments keep getting scrambled and producing misleading results
phased_loci <- read.loci("phasing/phased_genos_run.txt", header = FALSE, col.pop = 1, col.loci = 2:15)
haps <- haplotype(phased_loci, check.phase = TRUE, )
net <- haploNet(haps)
# structure -> df -> haplotype
phased_loci <- read.structure("phasing/phased_genos_run.stru", n.ind = 882, n.loc = 14, onerowperind = FALSE, col.pop = 1, col.lab = 0, col.others = 0, row.marknames = 0)
#phased_loci <- alleles2loci(phased_loci$tab, phased = TRUE)
df <- genind2df(phased_loci)
h <- haplotype(as.matrix(df))
hn <- haploNet(h)
plot(hn, fast = FALSE)
#seems like phasing is getting lost in here somewhere because there are only 88 unique haps in the raw data
# this approach (below) works, but it keeps the haplotypes as characters making the distance calculation inappropriate
phased_loci <- read_tsv("phasing/phased_genos_run.txt")
hap <- haplotype(as.matrix(phased_loci[,-1]))
hap <- sort(hap, what = "labels")
dst <- dist.dna(hap)
dst2 <- dist
tree <- rmst(dst)
ind.hap<-with(
stack(setNames(attr(hap, "index"), row.names(hap))),
table(hap=ind, pop=phased_loci$pop)
)
plot(tree, size=attr(tree, "weight"), scale.ratio = 10, labels= FALSE)
plot(tree, size=attr(tree, "weight"), scale.ratio = 100, cex = 0.8, pie=ind.hap, labels = F)
legend(50,50,colnames(ind.hap), col=rainbow(ncol(ind.hap)), pch=20)
#
hn2 <- haploNet(hap)
countHap <- function(hapl = hap, dna = phased_loci){
with(
stack(setNames(attr(hapl, "index"), rownames(hapl))),
table(hapl = ind, pop = dna$pop)
)
}
countHap()
plot(hn2, pie = countHap(), size = attr(hn2, "freq"), labels= F, scale.ratio = 100 )
legend(50,50,colnames(countHap()), col=rainbow(ncol(countHap())), pch=20)
Here we make a couple rough plots to see if the haplotypes make sense.
phased_genos <- read_tsv("phasing/phased_genos_run.txt")
#polarize to winter (find the most common allele in winter convert to 1, then convert alternative allele to 0)
#gather the alleles for winter
phased_genos %>%
filter(run == "Winter") %>%
count(Chr28_11667578, Omy_GREB1_09, Chr28_11607954, Omy_GREB1_05, Chr28_11625241, Chr28_11632591, Chr28_11658853, `Omy_RAD47080-54`, `Omy_RAD15709-53`, Chr28_11671116, Chr28_11676622, Chr28_11683204, Chr28_11773194, Chr28_11702210) %>%
slice(which.max(n))
#convert to polarized counts, just did this manually because there are only 14 of them and it was faster (sorry for the hardcoding later David...)
phased_genos <- phased_genos %>%
mutate(Chr28_11667578 = if_else(Chr28_11667578 == "C", 1, 0)) %>%
mutate(Omy_GREB1_09 = if_else(Omy_GREB1_09 == "G", 1, 0)) %>%
mutate(Chr28_11607954 = if_else(Chr28_11607954 == "G", 1, 0)) %>%
mutate(Omy_GREB1_05 = if_else(Omy_GREB1_05 == "G", 1, 0)) %>%
mutate(Chr28_11625241 = if_else(Chr28_11625241 == "G", 1, 0)) %>%
mutate(Chr28_11632591 = if_else(Chr28_11632591 == "G", 1, 0)) %>%
mutate(Chr28_11658853 = if_else(Chr28_11658853 == "C", 1, 0)) %>%
mutate(`Omy_RAD47080-54` = if_else(`Omy_RAD47080-54` == "G", 1, 0)) %>%
mutate(`Omy_RAD15709-53` = if_else(`Omy_RAD15709-53` == "G", 1, 0)) %>%
mutate(Chr28_11671116 = if_else(Chr28_11671116 == "T", 1, 0)) %>%
mutate(Chr28_11676622 = if_else(Chr28_11676622 == "G", 1, 0)) %>%
mutate(Chr28_11683204 = if_else(Chr28_11683204 == "T", 1, 0)) %>%
mutate(Chr28_11773194 = if_else(Chr28_11773194 == "A", 1, 0)) %>%
mutate(Chr28_11702210 = if_else(Chr28_11702210 == "G", 1, 0))
#lets get rid of the markers that are not diagnostic or heaviliy weighted in the a priori DAPC
phased_genos <- phased_genos %>%
select(-one_of(c("Chr28_11607954","Omy_GREB1_09", "Chr28_11632591", "Chr28_11773194")))
#lets get rid of the "problem markers"
phased_genos <- phased_genos %>%
select(-one_of(c("Omy_RAD47080-54", "Chr28_11671116")))
winter_pac <- filter(phased_genos, run == "Winter")
summer_pac <- filter(phased_genos, run == "Summer")
fall_pac <- filter(phased_genos, run == "fall")
halfpounder_pac <- filter(phased_genos, run == "halfpounder")
a = pheatmap(winter_pac[,-c(1,2)], cluster_cols = FALSE, show_rownames = FALSE, main = "winter")
# b = pheatmap(summer_pac[,-c(1,2)], cluster_cols = FALSE, show_rownames = FALSE, main = "Summer") # this wont plot as there's no variation
c = pheatmap(fall_pac[,-c(1,2)], cluster_cols = FALSE, show_rownames = FALSE, main = "fall")
d = pheatmap(halfpounder_pac[,-c(1,2)], cluster_cols = FALSE, show_rownames = FALSE, main = "halfpounder")
#now without the uninformative markers (i.e. only keep diagnostic + greb05)
#skipped this and made better figures in the next code chunk
#
# phased_genos <- phased_genos %>%
# select(-one_of(c("Chr28_11607954","Omy_GREB1_09", "Chr28_11632591", "Chr28_11773194")))
#
# winter_pac <- filter(phased_genos, run == "Winter")
# summer_pac <- filter(phased_genos, run == "Summer")
# fall_pac <- filter(phased_genos, run == "fall")
# halfpounder_pac <- filter(phased_genos, run == "halfpounder")
#
# #note that b won't run now because there is no variation
# a = pheatmap(winter_pac[,-c(1,2)], cluster_cols = FALSE, show_rownames = FALSE, main = "winter")
# #b = pheatmap(summer_pac[,-c(1,2)], cluster_cols = FALSE, show_rownames = FALSE, main = "summer")
# c = pheatmap(fall_pac[,-c(1,2)], cluster_cols = FALSE, show_rownames = FALSE, main = "fall")
# d = pheatmap(halfpounder_pac[,-c(1,2)], cluster_cols = FALSE, show_rownames = FALSE, main = "halfpounder")
Next let’s make some publication quality figures that more clearly express what is going on here. Goals: separate run types, within fall and half pounders separate into assignments groups, so that we can examine what types of haplotypes contribute to each, also within each plot make sure it is clustered
Plot below shows indiviudal haplotypes (from only informative loci) in the greb1L/rock1 region, haplotypes (rows) are hierarchically clustered, individual snps contributing to haplotypes (columns) are in genomic order, color is polarized so early summer is purple and winter is yellow.
#prepackaged heatmap tools are unlikely to work well given all the weird things we want to do, so lets just build our own ggplot output group by group (winter + summer + 3 each of fall and halfpounder)
#remove uninformative snps
#phased_genos <- phased_genos %>%
# select(-one_of(c("Chr28_11607954","Omy_GREB1_09", "Chr28_11632591", "Chr28_11773194")))
#get assignment results
# first get the assignments
ld1_2 <- as.data.frame(dapc_full_apriori$ind.coord) %>%
rownames_to_column(var="sample") %>%
left_join(pops_info)
# assignments using original DAPC
CIs <- ld1_2 %>%
group_by(pop) %>%
summarise(loCI = quantile(LD1, probs = 0.025),
hiCI = quantile(LD1, probs = 0.975))
#number of half pounders that fall in the 95% credible interval for winter fish assignment
assn <- ld1_2 %>%
mutate(assignment = case_when(
LD1 < CIs$hiCI[4] ~ "winter",
LD1 > CIs$loCI[3] ~ "summer",
LD1 <= CIs$loCI[3] & LD1 >= CIs$hiCI[4] ~ "unassigned")) %>%
select(sample, assignment)
#split the halfpounders and fall into groups by assignment
half_fall_ph <- phased_genos %>%
filter(run == "halfpounder" | run == "fall") %>%
left_join(assn)
fall_w <- half_fall_ph %>%
filter(run == "fall" & assignment == "winter")
fall_s <- half_fall_ph %>%
filter(run == "fall" & assignment == "summer")
fall_u <- half_fall_ph %>%
filter(run == "fall" & assignment == "unassigned")
halfpounder_w <- half_fall_ph %>%
filter(run == "halfpounder" & assignment == "winter")
halfpounder_s <- half_fall_ph %>%
filter(run == "halfpounder" & assignment == "summer")
halfpounder_u <- half_fall_ph %>%
filter(run == "halfpounder" & assignment == "unassigned")
winter_pac <- phased_genos %>%
filter(run == "Winter")
summer_pac <- phased_genos %>%
filter(run == "Summer")
# get by group clustering results
require(ggdendro)
dendro_winter <- as.dendrogram(hclust(d = dist(winter_pac[,-c(1,2)])))
dendro_summer <- as.dendrogram(hclust(d = dist(summer_pac[,-c(1,2)])))
dendro_fall_w <- as.dendrogram(hclust(d = dist(fall_w[,-c(1,2,11)])))
dendro_fall_s <- as.dendrogram(hclust(d = dist(fall_s[,-c(1,2,11)])))
dendro_fall_u <- as.dendrogram(hclust(d = dist(fall_u[,-c(1,2,11)])))
dendro_halfpounder_w <- as.dendrogram(hclust(d = dist(halfpounder_w[,-c(1,2,11)])))
dendro_halfpounder_s <- as.dendrogram(hclust(d = dist(halfpounder_s[,-c(1,2,11)])))
dendro_halfpounder_u <- as.dendrogram(hclust(d = dist(halfpounder_u[,-c(1,2,11)])))
#add row ids for ordering later
winter_pac <- rowid_to_column(winter_pac, "row_id")
summer_pac <- rowid_to_column(summer_pac, "row_id")
fall_w <- rowid_to_column(fall_w, "row_id")
fall_s <- rowid_to_column(fall_s, "row_id")
fall_u <- rowid_to_column(fall_u, "row_id")
halfpounder_w <- rowid_to_column(halfpounder_w, "row_id")
halfpounder_s <- rowid_to_column(halfpounder_s, "row_id")
halfpounder_u <- rowid_to_column(halfpounder_u, "row_id")
#############
# convert to long format, retain dendrogram order and snp order
##############
long_winter <- winter_pac %>%
pivot_longer(cols = !(c("run", "sample", "row_id")) ,names_to = "snp", values_to = "allele_count") %>%
mutate(row_id = factor(row_id, levels=unique(winter_pac$row_id[order.dendrogram(dendro_winter)]), ordered = TRUE)) %>%
mutate(snp = factor(snp, levels = c("Omy_GREB1_05", "Chr28_11625241", "Chr28_11658853", "Chr28_11667578", "Omy_RAD15709-53", "Chr28_11676622", "Chr28_11683204", "Chr28_11702210"), ordered = TRUE))
long_summer <- summer_pac %>%
pivot_longer(cols = !(c("run", "sample", "row_id")) ,names_to = "snp", values_to = "allele_count") %>%
mutate(row_id = factor(row_id, levels=unique(summer_pac$row_id[order.dendrogram(dendro_summer)]), ordered = TRUE)) %>%
mutate(snp = factor(snp, levels = c("Omy_GREB1_05", "Chr28_11625241", "Chr28_11658853", "Chr28_11667578", "Omy_RAD15709-53", "Chr28_11676622", "Chr28_11683204", "Chr28_11702210"), ordered = TRUE))
long_fall_w <- fall_w %>%
pivot_longer(cols = !(c("run", "sample", "row_id", "assignment")) ,names_to = "snp", values_to = "allele_count") %>%
mutate(row_id = factor(row_id, levels=unique(fall_w$row_id[order.dendrogram(dendro_fall_w)]), ordered = TRUE)) %>%
mutate(snp = factor(snp, levels = c("Omy_GREB1_05", "Chr28_11625241", "Chr28_11658853", "Chr28_11667578", "Omy_RAD15709-53", "Chr28_11676622", "Chr28_11683204", "Chr28_11702210"), ordered = TRUE))
long_fall_s <- fall_s %>%
pivot_longer(cols = !(c("run", "sample", "row_id", "assignment")) ,names_to = "snp", values_to = "allele_count") %>%
mutate(row_id = factor(row_id, levels=unique(fall_s$row_id[order.dendrogram(dendro_fall_s)]), ordered = TRUE)) %>%
mutate(snp = factor(snp, levels = c("Omy_GREB1_05", "Chr28_11625241", "Chr28_11658853", "Chr28_11667578", "Omy_RAD15709-53", "Chr28_11676622", "Chr28_11683204", "Chr28_11702210"), ordered = TRUE))
long_fall_u <- fall_u %>%
pivot_longer(cols = !(c("run", "sample", "row_id", "assignment")) ,names_to = "snp", values_to = "allele_count") %>%
mutate(row_id = factor(row_id, levels=unique(fall_u$row_id[order.dendrogram(dendro_fall_u)]), ordered = TRUE)) %>%
mutate(snp = factor(snp, levels = c("Omy_GREB1_05", "Chr28_11625241", "Chr28_11658853", "Chr28_11667578", "Omy_RAD15709-53", "Chr28_11676622", "Chr28_11683204", "Chr28_11702210"), ordered = TRUE))
long_half_u <- halfpounder_u %>%
pivot_longer(cols = !(c("run", "sample", "row_id", "assignment")) ,names_to = "snp", values_to = "allele_count") %>%
mutate(row_id = factor(row_id, levels=unique(halfpounder_u$row_id[order.dendrogram(dendro_halfpounder_u)]), ordered = TRUE)) %>%
mutate(snp = factor(snp, levels = c("Omy_GREB1_05", "Chr28_11625241", "Chr28_11658853", "Chr28_11667578", "Omy_RAD15709-53", "Chr28_11676622", "Chr28_11683204", "Chr28_11702210"), ordered = TRUE))
long_half_w <- halfpounder_w %>%
pivot_longer(cols = !(c("run", "sample", "row_id", "assignment")) ,names_to = "snp", values_to = "allele_count") %>%
mutate(row_id = factor(row_id, levels=unique(halfpounder_w$row_id[order.dendrogram(dendro_halfpounder_w)]), ordered = TRUE)) %>%
mutate(snp = factor(snp, levels = c("Omy_GREB1_05", "Chr28_11625241", "Chr28_11658853", "Chr28_11667578", "Omy_RAD15709-53", "Chr28_11676622", "Chr28_11683204", "Chr28_11702210"), ordered = TRUE))
long_half_s <- halfpounder_s %>%
pivot_longer(cols = !(c("run", "sample", "row_id", "assignment")) ,names_to = "snp", values_to = "allele_count") %>%
mutate(row_id = factor(row_id, levels=unique(halfpounder_s$row_id[order.dendrogram(dendro_halfpounder_s)]), ordered = TRUE)) %>%
mutate(snp = factor(snp, levels = c("Omy_GREB1_05", "Chr28_11625241", "Chr28_11658853", "Chr28_11667578", "Omy_RAD15709-53", "Chr28_11676622", "Chr28_11683204", "Chr28_11702210"), ordered = TRUE))
############
#build heatmaps
############
heatmap_w <- ggplot(data = filter(long_winter), aes(x = snp, y = row_id)) +
geom_tile(aes(fill = allele_count))+scale_fill_viridis_c()+theme_classic()+theme( axis.text.y = element_blank(), axis.ticks.y = element_blank(), axis.line.y = element_blank(), legend.position = "none", axis.text.x = element_blank())+ylab("Winter\n ")+xlab(element_blank())
heatmap_s <- ggplot(data = filter(long_summer), aes(x = snp, y = row_id)) +
geom_tile(aes(fill = allele_count))+theme_classic()+scale_fill_gradient(low = "#440154FF", high = "#440154FF")+theme( axis.text.y = element_blank(), axis.ticks.y = element_blank(), axis.line.y = element_blank(), legend.position = "none", axis.text.x = element_blank())+ylab("Early-Summer\n ")+xlab(element_blank())
heatmap_fw <- ggplot(data = filter(long_fall_w), aes(x = snp, y = row_id)) +
geom_tile(aes(fill = allele_count))+scale_fill_viridis_c()+theme_classic()+theme( axis.text.y = element_blank(), axis.ticks.y = element_blank(), axis.line.y = element_blank(), legend.position = "none", axis.text.x = element_blank())+ylab("Late-Summer\nWinter Assigned")+xlab(element_blank())
heatmap_fs <- ggplot(data = filter(long_fall_s), aes(x = snp, y = row_id)) +
geom_tile(aes(fill = allele_count))+scale_fill_viridis_c()+theme_classic()+theme( axis.text.y = element_blank(), axis.ticks.y = element_blank(), axis.line.y = element_blank(), legend.position = "none", axis.text.x = element_blank())+ylab("Late-Summer\nEarly-Summer ssigned")+xlab(element_blank())
heatmap_fu <- ggplot(data = filter(long_fall_u), aes(x = snp, y = row_id)) +
geom_tile(aes(fill = allele_count))+scale_fill_viridis_c()+theme_classic()+theme( axis.text.y = element_blank(), axis.ticks.y = element_blank(), axis.line.y = element_blank(), legend.position = "none", axis.text.x = element_blank())+ylab("Late-Summer\nUnassigned")+xlab(element_blank())
heatmap_hw <- ggplot(data = filter(long_half_w), aes(x = snp, y = row_id)) +
geom_tile(aes(fill = allele_count))+scale_fill_viridis_c()+theme_classic()+theme( axis.text.y = element_blank(), axis.ticks.y = element_blank(), axis.line.y = element_blank(), legend.position = "none", axis.text.x = element_blank())+ylab("Halfpounder\nWinter Assigned")+xlab(element_blank())
heatmap_hs <- ggplot(data = filter(long_half_s), aes(x = snp, y = row_id)) +
geom_tile(aes(fill = allele_count))+scale_fill_viridis_c()+theme_classic()+theme( axis.text.y = element_blank(), axis.ticks.y = element_blank(), axis.line.y = element_blank(), legend.position = "none", axis.text.x = element_blank())+ylab("Halfpounder\nSummer Assigned")+xlab(element_blank())
heatmap_hu <- ggplot(data = filter(long_half_u), aes(x = snp, y = row_id)) +
geom_tile(aes(fill = allele_count))+scale_fill_viridis_c()+theme_classic()+theme(axis.text.x = element_text(angle = 90), axis.text.y = element_blank(), axis.ticks.y = element_blank(), axis.line.y = element_blank(), legend.position = "none")+ylab("Halfpounder\nUnassigned")
#make the final plot
require(gridExtra)
grid.arrange(heatmap_s, heatmap_w, heatmap_fs, heatmap_fw, heatmap_fu, heatmap_hs, heatmap_hw, heatmap_hu, ncol = 1, heights = c(3,3,3,3,3,3,3,7), clip = FALSE)
Let’s also build a haplotype network only for the informative markers.
This involves removing the uninformative markers from the .phylip file and running popart again (See above)
Let’s also collect the same stats again
#here to remove the two unwanted markers deleted the rows in a texteditor from phased_genos3_r.txt then saved as phased_genos4_r.txt
#ph_geno <- read_tsv("phasing/phased_genos3_r.txt")
ph_geno <- read_tsv("phasing/phased_genos4_r.txt")
ph_geno <- genos_2.3 %>%
select(Sample, run) %>%
left_join(ph_geno, by = c("Sample" = "ind"))
ph_geno %>%
group_by(run) %>%
summarise(count = n_distinct(hap))
#did the same as above (removed the bad columns) from phased_genos3.phylip and ran again
#ran popart on phased_genos4.phylip and traits2.txt
knitr::include_graphics('phasing/phased_genos4.phylip.svg')
The density of markers in greb1L/rock1 region in our GTseq dataset allows us to identify characteristic winter and early-summer run haplotypes from our samples. if we focus on only the sites that vary strongly between winter and summer runs, there is a single summer haplotype (note there are multiple haps within this group when considering all sites, not just the diagnostic + heavily weighted ones), and a handful (5?) haplotypes within winter run.
summer and winter assigned halfpounder and fall run fish possess largely winter and summer run haplotypes. unassigned halfpounders and fall run fish also possess 1 copy each of winter and summer run haplotypes, suggesting many unassigned fish are first generation hybrids between winter and summer runs, however, there are also many highly recombined haplotypes among the unassigned halfpounders and fall run fish. evidence of recombination within the greb1L/rock1 region haplotypes suggests gene flow between winter and summer run fish occurs beyond first generation hybrids. however we did not observe any recombined greb1L/rock1 region haplotypes within our sample of winter and summer run fish.
this opens up a question, what spawning events allow for recombination? we propose that there may be partial temporal isolation of early-summer from winter run fish, with fall run adults serving as a bridge between the extreme distribution of run timing. indeed we also discovered a weak but significant association between DAPC DA1 score and arrival timing among fall run fish.
this absence of recombined haplotypes within winter and summer samples may reflect extreme phenotype sampling (adult samples were identified as winter and summer runs by timing of arrival at spawning grounds, taken on far ends of distribution (late winters vs early early-summers))
this also calls into question what phenotype we are looking at, runs are named based on their timing of freshwater entry, but appear to also segregate at least weakly on the basis of spawning time/arrival at spawning grounds.
to continue thinking about differences between halfpounder and fall haplotype results?
where do halfpounders come in here?
Two multivariate approaches (unconstrained ordination (PCA) and constrained ordination (DAPC)) as well as a model-based bayesian clustering approach (STRUCTURE) were used to interrogate population genetic structure among 3 runs of adult Rogue River steelhead as well as half-pounders. We also estimated genetic distances among run types using Fst. These approaches attempt to identify structure either (a) among a priori assigned grouping of organisms (in our case run timing), or (b) among de novo identified genetic clusters. We draw inferences on population genetic structure within the Rogue River based on the strong agreement among multivariate and bayesian as well as de novo and a priori approaches to characterizing population genetic structure. However, it is important to note that our results are based on the small portion of total genetic variation represented by our GTseq panel markers, and that these markers do not represent an unbiased representation of genetic variation among Rogue River steelhead. For example, “neutral” markers in our panel were chosen to capture genetic variation at wide spatial scales and not within individual drainages, so our study may underestimate the extent of population differentiation at neutral markers. Similarly, the inclusion of multiple markers associated with run-timing phenotypes allows us to capture strong differentiation at these genomic regions, but these regions are strong outliers across many population genetic parameters, are likely over represented in our dataset relative to neutral markers and generally will overestimate genome-wide genetic differentiation. Most critically, our dataset may exclude genetic variants that drive adaptive phenotypic variation within the Rogue River. As many of our approaches utilize different subsets of markers, and differ in their sensitivity to various biases in the dataset, it is important to consider each in light of its respective caveats and underlying assumptions.
No Difference Among Late-Summer Adults and Half-Pounders
While there is evidence of structure within half pounders and late-summer run steelhead (see discussion below), we did not find any evidence to suggest restricted gene flow among these two groups. Genetic differentiation (Fst) was low at both neutral markers and on average across the full GTseq dataset and there is little distance among group centroids in both the neutral and full PCA. STRUCTURE results suggest that half-pounders and late summer run fish derive similar proportions of their ancestry from different ancestral genetic clusters, regardless of the number of genetic clusters being modeled. Additionally, individuals within each group demonstrate similar variance in this metric. We also failed to discriminate among half-pounder and late summer run fish using a discriminant analysis of principal components of genetic variation in our dataset.
Structure Among Winter and Early-Summer Run Steelhead
In contrast, we identify population genetic structure among winter and early-summer run steelhead at both neutral and adaptive genetic markers. Using markers annotated as neutral in our GTseq dataset, total differentiation is low (Fst ~ 1%), and there is little to no differentiation apparent in the PCA, however, this low level of differentiation is sufficient to clearly discriminate among winter and summer run fish using DAPC.
When using the full dataset, structure among winter and early summer run fish is strong. When 3 or more ancestral genetic clusters are modeled, STRUCTURE estimates alternative genetic clusters that dominate the ancestry of winter and early summer run steelhead, although all individuals demonstrate some degree of admixture with the alternative population’s dominant cluster. Using DAPC, we are able to describe a single axis of genetic variation that strongly discriminate among winter and early-summer run fish. This axis is primarily driven by variation at markers with known associations to run-timing that are diagnostic or nearly diagnostic among winter and early-summer run steelehad, and, to a lesser extent, several neutral annotated markers and markers with known associations to residency vs anadromy.
Structure within Late-Summer and Half-Pounders
The matter of separating late-summer and half-pounder steelhead from early-summer and winter runs is less clear. While all analyses that identified structure within the Rogue River steelhead place half-pounder and late-summer run steelhead as intermediate between winter and early-summer steelhead at the group-level, there is (a) substantial overlap in genetic variation among both half-pounder and fall run steelhead with winter and early summer run steelhead and (b) evidence of population structure within half-pounder and late-summer run steelhead.
Inference of structure within both fall and half-pounder fish comes from several lines of evidence. Among half-pounders, and to a lesser extent late-summer run steelhead, many markers were out of HWE with a significant excess of homozygosity, suggestive of Wahlund effects. These markers were significantly enriched for run-timing associated markers, suggesting that structure within half-pounders and fall run fish is due in part to groups of either early-summer-like or winter-like individuals. Also, de novo identification of population genetic structure via k means clustering always separated late-summer and half-pounder individuals into genetic clusters with both late-summer and winter run adults, regardless of the number of clusters used. Finally, unlike k-means clustering, in STRUCTURE, individuals may derive their ancestry from multiple clusters, and we see that while there are no strong differences at the group level in ancestry, there is a high level of variation within each group.
Indeed, when we attempt to classify both half-pounder and late-summer run fish as winter-like or early-summer like using a linear combination of alleles that strongly discriminate winter and early summer run fish (discriminant axis 1), we observe a group of strongly winter-like fish, a second group of intermeditate fish and a small number of early-summer-like fish. Direct examination of genotypes at known-run timing markers reveals the same pattern. One caveat here that might be worth including in anything published for a wider audience: Methods like the DAPC, and even the association analyses that are the basis of the annotations for our panel, operate under an assumption of additivity. Without a clear, mechaninistic understanding of the g -> p map for run timing, we can not predict run-timing phenotypes from such heterozygous and discordant genotypes at run-timing associated markers. Variation in effect size among loci, dominance, epistasis, the effects of un-genotyped causal loci and plasticity all may be at play here.
add discussion about diagnostic vs highly weighted in DAPC and about difference in phenotypes used for annotation (CRITFC annotations are for interior pops and the phenotype is arrival timing on spawning grounds, whereas prince et al annotations are for freshwater entry) compare our resutls with those from the latest critfc paper and the coastal lineages
also add discussion about how some markers with known annotations are not diagnostic or not even informative, and this could have somehting to do with divergence in the summer haplotype as it spread (assuming a single origin)
In addition to structure within fall and half-pounders along the genetic axis differentiating early-summer and winter runs, we also identified structure within these groups driven by both neutral markers and adaptive markers, including all 5 markers in the dataset with residency-vs-anadromy annotations and a neutral annotated marker with strong residency-vs-anaadromy associations in the Klickitat River.
Taken together, we propose that both late-summer run and half-pounders are a mixed assemblage of individuals demonstrating genetic characteristics inclusive of winter-run, early-summer run and a highly heterozygous intermediate group. Also, despite mark-recapture evidence that early-summer and late-summer run adults are found together on the spawning grounds, genetic differentiation at GTseq panel markers suggests late-summer (and half-pounder) run adults are less differentiated from winter run than early-summer run adults at both neutral and adaptive regions of the genome.
This section is a dumping ground for ideas if we want to move forward with this investigation, and stray thoughts that might warrant returning to in the future. Not edited, remove before sharing widely.
Intermediate freq Ch28
How is it possible to have intermediate MAFs within the Ch28 region at a small subset of markers, when the rest are nearly fixed within the region within winter/early summer runs? These “discordant” genotypes suggests either (a) multiple haplotypes (more than two) segregating within the region, (b) two (or more) separate haplotype blocks with the “RoSA” that are not coinherited (i.e. wildly high recombination in a region that seems to have low recombination in the rest of the species…unlikely), (c) extremely strong balancing selection at the genomic region level (also would be a wild thing to observe) or (d) major genotyping error (Need to go in and check out the allele correction values / genotyping plot for these markers: Chr28_11607954, Chr28_11632591, Ch28_11773194, Omy_GREB1_09)
Should also think about whether or not the sequencing data we have might be used to call statistically phased haplotypes for this region (it’s pretty densely genotyped). Conducting an analysis with information with the provided by haplotypes can probably get to the bottom of this, but should think more seriously about what it will provide before investing the time to figure it out.
Is an extremely recent duplication even another possible explanation for this weird pattern that we see.
Neutral structure outliers (strays) in halfpounders
There is evidence that half pounders may return to Rogue despite being native to a different stream, then return to the their natal stream during their subsequent adult spawning run(s). Do we see evidence of any half-pounders from other streams (i.e. outlier fish at neutral loci). This behavior is also predicted as a consequence of the proposed explanation of the half pounder lifestyle as a behavioral adaptation to avoid a recurring region of high SSTs near the Klamath and Rogue basins
Quote from Eversest 1971: Additional evidence of this behavior was noted when returns from “half-pounders” tagged in 1968 were analyzed. Thirty-one fish were tagged in one seine-haul on August 14, 1968, and five were recaptured in the Rogue the sane year. On subsequent migrations in 1969 and 1970, only one of the fish was recaptured in the Rogue while two were taken outside the system, one in the Smith River, and one from the Trinity River, both in northern California.
Evolutionary explanation for maintenance of half-pounders
The results as presented here leave us with something of a problem, if field studies suggest there is limited gene flow among early-summer and winter fish, then (i) why are fall run fish more closely related to winter fish, and (ii) how is the existence of the half pounder or fall run lifestyle maintained. Temporally balancing selection is generally invoked to maintain the existence of multiple life-histories in salmonids, but in this case we may have an example of true balancing selection (heterozygote advantage) as the half-pounder lifehistory (at least at the tiny number of markers considered here) seems to be driven by high heterozygosity at run timing associated markers. The same goes for fall run fish. If return to the rogue in late summer is driven by unfavorable sea surface conditions and a SST “trap,” it would follow that both half pounders and fall run life histories are behavioral adaptations under strong selection, and the alternative life-histories are less fit, but maintained as a short term (non-equilibrium) consequence of balancing selection. In other words, the SST trap leads to strong selection for a fall run and this has different impacts depending on the level of maturity for an individual: for juveniles a fall run results in a false spawingn run (half pounder), for adults, this results in a fall run. Individuals that fail to make the fall run (either as half pounders or mature adults) face selection. This would be a pretty shocking result, but I’m failing to find an alternative explanation. It definitely motivates further study either via a GWAS or a genome wide perspective of demography and selection within the Rogue. I’d be particularly interested in collecting haplotypic data to examine recent demographic histories and evidence of selection. At the very least looking for balancing selection using outlier approaches within fall run fish.
Proportions
do the assignment proportions in half pounder and fall run fish match with the run counts for the appropriately matched year?
pattern within fall
we have dates for approximate freshwater entry for the fall run fish. is there a correlation in the winter-vs-summer genetic axis and date of entry? there is substantial overlap in the entry timing of fish referred to as winter and late-summer run, and substantial interannual variation in the peak of run timing within a group, taken together this suggests that some late late-summer fish could instead be early-arriving winter run fish. this would show up as a correlation between run timing genotype and arrival date with the fall fish. let’s check real quick
#intake files
half_2018_intake <- readxl::read_xlsx("../meta_data/2018_halfpounder/OmyJC18ROGR_STHP Intake form Spread sheet.xlsx", sheet = 3)
half_2019_intake <- readxl::read_xlsx("../meta_data/2019_halfpounder/STHP Intake form Spread sheet 2019.xlsx", sheet = 1)
fall_intake <- readxl::read_xlsx("../meta_data/2019_fall/Rogue Adult Summer and Winter (06 11 20).xlsx", sheet = 1)
summer_intake <- readxl::read_xlsx("../meta_data/2019_summer/Omy Rogue2019 steelhead datasheets.xlsx", sheet = 2)
winter_intake <- readxl::read_xlsx("../meta_data/2019_winter/StW Scales for DNA (06 17 19).xlsx", sheet = 3)
#merge intakes
# first clean them up a bit to make merging easier
half_2018_intake <- half_2018_intake[,c(1,6)]
colnames(half_2018_intake)[1] <- "ID"
half_2018_intake$run <- "halfpounder"
half_2018_intake$year <- "2018"
colnames(half_2018_intake) <- c("ID", "Date", "run", "year")
half_2019_intake <- half_2019_intake[,c(1,2)]
half_2019_intake$run <- "halfpounder"
half_2019_intake$year <- "2019"
colnames(half_2019_intake) <- c("ID", "Date", "run", "year")
summer_intake <- summer_intake[,c(2,3,6)]
colnames(summer_intake) <- c("ID", "Date", "run")
summer_intake$year <- "2019"
winter_intake <- winter_intake[,c(2,3,7)]
colnames(winter_intake) <- c("ID", "Date", "run")
winter_intake$year <- "2019"
fall_intake <- subset(fall_intake, Run == "Summer")
fall_intake <- fall_intake[,c(1,3,6)]
colnames(fall_intake) <- c("ID", "Date", "run")
fall_intake$run <- "fall"
fall_intake$year <- "2019"
meta_data <- bind_rows(half_2018_intake, half_2019_intake, fall_intake, winter_intake, summer_intake)
cor_data <- meta_data %>%
right_join(ld1_2, by = c("ID" = "sample" ))
ggplot(cor_data[cor_data$run=="fall",])+geom_point(aes(Date,LD1))+theme_classic()+geom_smooth(aes(Date,LD1), method = "lm")
cor_data$jdate <- format(as.POSIXlt(cor_data$Date, format = "%y-%m-%d"), "%j")
summary(lm(LD1~as.numeric(jdate), data=cor_data[cor_data$run=="fall",]))
##
## Call:
## lm(formula = LD1 ~ as.numeric(jdate), data = cor_data[cor_data$run ==
## "fall", ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.10257 -0.74800 0.01029 0.66872 2.62669
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.520251 1.268902 1.986 0.0488 *
## as.numeric(jdate) -0.010246 0.005015 -2.043 0.0427 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9074 on 155 degrees of freedom
## Multiple R-squared: 0.02623, Adjusted R-squared: 0.01994
## F-statistic: 4.174 on 1 and 155 DF, p-value: 0.04273
summary(lm(LD1~as.numeric(jdate), data=cor_data[cor_data$run=="halfpounder",]))
##
## Call:
## lm(formula = LD1 ~ as.numeric(jdate), data = cor_data[cor_data$run ==
## "halfpounder", ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.2536 -0.8270 -0.2535 0.7510 3.2339
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.743636 0.835147 2.088 0.0372 *
## as.numeric(jdate) -0.007305 0.003335 -2.191 0.0288 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.066 on 641 degrees of freedom
## Multiple R-squared: 0.00743, Adjusted R-squared: 0.005882
## F-statistic: 4.798 on 1 and 641 DF, p-value: 0.02885
ggplot(cor_data[cor_data$run=="halfpounder",])+geom_point(aes(jdate,LD1))+theme_classic()+geom_smooth(aes(jdate,LD1), method = "lm")
cor_data %>% group_by(run) %>%
summarise(r = range(jdate))
There is a significant, but minor trend (r2 = 0.03, p = 0.04) towards more genetically winter-like fish over time among fall run fish.
also look within halfpounders
let’s check to make sure there are no interesting patterns / structure among halfpounder years
# create the dataset
half_genind <- seppop(genind_2.1)$halfpounder
years <- str_sub(row.names(half_genind@tab), 6,7)
half_genind$pop <- as.factor(years)
# optimized number of pcs with optim.a.score, smart = F
#half_dapc_year <- dapc(half_genind, n.pca = 200, n.da =1)
#optim.a.score(half_dapc_year, smart = FALSE)
half_dapc_year <- dapc(half_genind, n.pca = 31, n.da =1)
scatter.dapc(half_dapc_year)
Even with DAPC, very little structure year to year in halfpounders.
#merge polarized allele count data with sample site, population year sample date sample location genotyped based on the 9 loci in bold-face type origin
# first get the assignments
ld1_2 <- as.data.frame(dapc_full_apriori$ind.coord) %>%
rownames_to_column(var="sample") %>%
left_join(pops_info)
# assignments using original DAPC
CIs <- ld1_2 %>%
group_by(pop) %>%
summarise(loCI = quantile(LD1, probs = 0.025),
hiCI = quantile(LD1, probs = 0.975))
#number of half pounders that fall in the 95% credible interval for winter fish assignment
assn <- ld1_2 %>%
mutate(assignment = case_when(
LD1 < CIs$hiCI[4] ~ "winter",
LD1 > CIs$loCI[3] ~ "summer",
LD1 <= CIs$loCI[3] & LD1 >= CIs$hiCI[4] ~ "unassigned")) %>%
select(sample, assignment)
#get hor/nor
md <- readxl::read_xlsx("./metadata/Omy Rogue2019 steelhead datasheets.xlsx", sheet = 4)
md <- md %>%
mutate(Origin = case_when(
Origin == "AD" ~ "HOR",
Origin == "1" ~ "NOR",
TRUE ~ Origin
))
to_exp <- polarized_allele_counts %>%
rownames_to_column(var = "sample") %>%
left_join(assn) %>%
left_join(select(genos_2.3, Sample, Date, run , year), by = c("sample" = "Sample")) %>%
left_join(select(md, SFGL_ID, Origin), by = c("sample" = "SFGL_ID"))
write_tsv(to_exp, "genotype_data/chr28_polarized_allele_counts_2.txt")